MobilityData / gtfs-realtime-validator

Java-based tool that validates General Transit Feed Specification (GTFS)-realtime feeds
41 stars 10 forks source link

Add tests for big feeds #30

Open barbeau opened 2 years ago

barbeau commented 2 years ago

Issue by barbeau Wednesday Apr 12, 2017 at 15:21 GMT Originally opened as


We need to make sure that as we add new rules, the validator can continue to run in real-time on production-sized feeds for major cities.

I posted a question on the GTFS-rt list asking for examples of very large feeds:!topic/gtfs-realtime/mM8cQIIV_-Y

These have been suggested to me so far, with largest coming first:

We should add some unit tests that do basic benchmarking to ensure we're not exceeding a given duration when processing feeds. I think 2 seconds may be reasonable, but we'll need to test. We'll also need to figure out how this works for CI, as Travis is significantly underpowered when compared to a typical desktop.

barbeau commented 2 years ago

Comment by barbeau Wednesday Apr 12, 2017 at 19:15 GMT

If I try to run the Dutch feed with -Xmx8g parameter on my machine (dual Xeon @ 2.5 GHz w/ 16GB RAM), I get this exception after it runs for a very long time (I left and came back an hour later):

javax.servlet.ServletException: org.glassfish.jersey.server.ContainerException: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at org.glassfish.jersey.servlet.WebComponent.service(
    at org.glassfish.jersey.servlet.ServletContainer.service(
    at org.glassfish.jersey.servlet.ServletContainer.service(
    at org.glassfish.jersey.servlet.ServletContainer.service(
    at org.eclipse.jetty.servlet.ServletHolder.handle(
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(
    at org.eclipse.jetty.servlet.ServletHandler.doScope(
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(
    at org.eclipse.jetty.server.Server.handle(
    at org.eclipse.jetty.server.HttpChannel.handle(
    at org.eclipse.jetty.server.HttpConnection.onFillable(
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
    at org.eclipse.jetty.util.thread.QueuedThreadPool$
Caused by: org.glassfish.jersey.server.ContainerException: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at org.glassfish.jersey.servlet.internal.ResponseWriter.rethrow(
    at org.glassfish.jersey.servlet.internal.ResponseWriter.failure(
    at org.glassfish.jersey.server.ServerRuntime$Responder.process(
    at org.glassfish.jersey.server.ServerRuntime$
    at org.glassfish.jersey.internal.Errors$
    at org.glassfish.jersey.internal.Errors$
    at org.glassfish.jersey.internal.Errors.process(
    at org.glassfish.jersey.internal.Errors.process(
    at org.glassfish.jersey.internal.Errors.process(
    at org.glassfish.jersey.process.internal.RequestScope.runInScope(
    at org.glassfish.jersey.server.ServerRuntime.process(
    at org.glassfish.jersey.server.ApplicationHandler.handle(
    at org.glassfish.jersey.servlet.WebComponent.service(
    ... 17 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.lang.AbstractStringBuilder.<init>(
    at java.lang.StringBuilder.<init>(
    at org.onebusaway.csv_entities.DelimitedTextParser.parse(
    at org.onebusaway.csv_entities.CSVLibrary.parse(
    at org.onebusaway.csv_entities.CsvTokenizerStrategy.parse(
    at org.onebusaway.csv_entities.CsvEntityReader.readEntities(
    at org.onebusaway.csv_entities.CsvEntityReader.readEntities(
    at org.onebusaway.csv_entities.CsvEntityReader.readEntities(
    at com.conveyal.gtfs.validator.json.FeedProcessor.load(
    at edu.usf.cutr.gtfsrtvalidator.api.resource.GtfsFeed.postGtfsFeed(
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(
    at java.lang.reflect.Method.invoke(
    at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(
    at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(
    at org.glassfish.jersey.server.ServerRuntime$
    at org.glassfish.jersey.internal.Errors$
    at org.glassfish.jersey.internal.Errors$
    at org.glassfish.jersey.internal.Errors.process(
    at org.glassfish.jersey.internal.Errors.process(
    at org.glassfish.jersey.internal.Errors.process(
    at org.glassfish.jersey.process.internal.RequestScope.runInScope(

So it looks like it's getting hung up in the static GTFS validation using the Conveyal gtfs-validator.

If I run the Dutch GTFS-rt feed with random GTFS data (I used HART in Tampa), then it processes each GTFS-rt iteration in about 1.1 seconds.

Using MBTA data, it processes each GTFS-rt iteration in about 1.1 seconds as well.

barbeau commented 2 years ago

Comment by barbeau Wednesday May 03, 2017 at 21:16 GMT

Here's a good list of GTFS-rt feeds from

barbeau commented 2 years ago

Comment by barbeau Monday May 08, 2017 at 19:40 GMT

Transitland issue for adding support for GTFS-rt feeds -

barbeau commented 2 years ago

Comment by barbeau Tuesday Sep 19, 2017 at 17:55 GMT

We could use the batch processor for benchmarking feed processing times - see README "Configuration options ->Batch processing":

barbeau commented 2 years ago

Comment by skjolber Sunday Mar 11, 2018 at 10:01 GMT

@barbeau did you try running the out-of-memory dataset using a profiler?

barbeau commented 2 years ago

Comment by barbeau Sunday Mar 11, 2018 at 21:17 GMT

No, not yet.

barbeau commented 2 years ago

Comment by barbeau Tuesday Dec 14, 2021 at 15:24 GMT

A good approach for this might be to graph performance on each PR instead of imposing hard limits via a unit test - that's what OpenTripPlanner is doing here:

barbeau commented 2 years ago

Comment by derhuerst Tuesday Dec 14, 2021 at 16:41 GMT

DELFI e.V. is a non-profit that aggregates transit datasets of all the local transit authorities/providers to create a unified feed fir Germany. It's official role is to publish NeTeX as mandatory per the EU regulation.

But it also publishes a GTFS feed generated from the merged data, which is currently 333mb in size. Its official site doesn't provide a direct & script-friendly URL for it (🙄), but @juliuste kindly mirrors it to

Currently, it is not much larger than the Dutch feed, but since over the coming months & years, missing regions as well as lots of stop/station & pathways.txt topologies will likely be added.

Edit: Unfortunately, to my knowledge, there are no realtime feeds available right now.