MobilityData / gtfs-realtime-validator

Java-based tool that validates General Transit Feed Specification (GTFS)-realtime feeds
Other
41 stars 10 forks source link

Add tests for big feeds #30

Open barbeau opened 2 years ago

barbeau commented 2 years ago

Issue by barbeau Wednesday Apr 12, 2017 at 15:21 GMT Originally opened as https://github.com/CUTR-at-USF/gtfs-realtime-validator/issues/123


Summary:

We need to make sure that as we add new rules, the validator can continue to run in real-time on production-sized feeds for major cities.

I posted a question on the GTFS-rt list asking for examples of very large feeds: https://groups.google.com/forum/#!topic/gtfs-realtime/mM8cQIIV_-Y

These have been suggested to me so far, with largest coming first:

We should add some unit tests that do basic benchmarking to ensure we're not exceeding a given duration when processing feeds. I think 2 seconds may be reasonable, but we'll need to test. We'll also need to figure out how this works for CI, as Travis is significantly underpowered when compared to a typical desktop.

barbeau commented 2 years ago

Comment by barbeau Wednesday Apr 12, 2017 at 19:15 GMT


If I try to run the Dutch feed with -Xmx8g parameter on my machine (dual Xeon @ 2.5 GHz w/ 16GB RAM), I get this exception after it runs for a very long time (I left and came back an hour later):

javax.servlet.ServletException: org.glassfish.jersey.server.ContainerException: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:423)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:386)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:334)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:221)
    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:800)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
    at org.eclipse.jetty.server.Server.handle(Server.java:497)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:313)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
    at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:626)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:546)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.glassfish.jersey.server.ContainerException: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at org.glassfish.jersey.servlet.internal.ResponseWriter.rethrow(ResponseWriter.java:256)
    at org.glassfish.jersey.servlet.internal.ResponseWriter.failure(ResponseWriter.java:238)
    at org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:486)
    at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:316)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
    at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)
    at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:291)
    at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1140)
    at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:403)
    ... 17 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:68)
    at java.lang.StringBuilder.<init>(StringBuilder.java:89)
    at org.onebusaway.csv_entities.DelimitedTextParser.parse(DelimitedTextParser.java:65)
    at org.onebusaway.csv_entities.CSVLibrary.parse(CSVLibrary.java:131)
    at org.onebusaway.csv_entities.CsvTokenizerStrategy.parse(CsvTokenizerStrategy.java:34)
    at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:154)
    at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:120)
    at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:115)
    at org.onebusaway.gtfs.serialization.GtfsReader.run(GtfsReader.java:172)
    at org.onebusaway.gtfs.serialization.GtfsReader.run(GtfsReader.java:160)
    at com.conveyal.gtfs.validator.json.FeedProcessor.load(FeedProcessor.java:73)
    at com.conveyal.gtfs.validator.json.FeedProcessor.run(FeedProcessor.java:44)
    at edu.usf.cutr.gtfsrtvalidator.api.resource.GtfsFeed.postGtfsFeed(GtfsFeed.java:180)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)
    at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)
    at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:308)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
    at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)

So it looks like it's getting hung up in the static GTFS validation using the Conveyal gtfs-validator.

If I run the Dutch GTFS-rt feed with random GTFS data (I used HART in Tampa), then it processes each GTFS-rt iteration in about 1.1 seconds.

Using MBTA data, it processes each GTFS-rt iteration in about 1.1 seconds as well.

barbeau commented 2 years ago

Comment by barbeau Wednesday May 03, 2017 at 21:16 GMT


Here's a good list of GTFS-rt feeds from Transitfeeds.com: http://transitfeeds.com/search?q=gtfsrt

barbeau commented 2 years ago

Comment by barbeau Monday May 08, 2017 at 19:40 GMT


Transitland issue for adding support for GTFS-rt feeds - https://github.com/transitland/transitland/issues/77.

barbeau commented 2 years ago

Comment by barbeau Tuesday Sep 19, 2017 at 17:55 GMT


We could use the batch processor for benchmarking feed processing times - see README "Configuration options ->Batch processing": https://github.com/CUTR-at-USF/gtfs-realtime-validator#configuration-options

barbeau commented 2 years ago

Comment by skjolber Sunday Mar 11, 2018 at 10:01 GMT


@barbeau did you try running the out-of-memory dataset using a profiler?

barbeau commented 2 years ago

Comment by barbeau Sunday Mar 11, 2018 at 21:17 GMT


No, not yet.

barbeau commented 2 years ago

Comment by barbeau Tuesday Dec 14, 2021 at 15:24 GMT


A good approach for this might be to graph performance on each PR instead of imposing hard limits via a unit test - that's what OpenTripPlanner is doing here: https://github.com/opentripplanner/OpenTripPlanner/pull/3783

barbeau commented 2 years ago

Comment by derhuerst Tuesday Dec 14, 2021 at 16:41 GMT


DELFI e.V. is a non-profit that aggregates transit datasets of all the local transit authorities/providers to create a unified feed fir Germany. It's official role is to publish NeTeX as mandatory per the EU regulation.

But it also publishes a GTFS feed generated from the merged data, which is currently 333mb in size. Its official site doesn't provide a direct & script-friendly URL for it (🙄), but @juliuste kindly mirrors it to https://de.data.public-transport.earth/gtfs-germany.zip.

Currently, it is not much larger than the Dutch feed, but since over the coming months & years, missing regions as well as lots of stop/station & pathways.txt topologies will likely be added.

Edit: Unfortunately, to my knowledge, there are no realtime feeds available right now.