etalab / transport-validator

GTFS validator
https://transport.data.gouv.fr/validation/
MIT License
37 stars 10 forks source link

Memory efficient validation #172

Closed AntoineAugusti closed 9 months ago

AntoineAugusti commented 1 year ago

The IDFM GTFS starts to be quite big. Today, 2023-08-16, the ZIP archive is 103 Mo.

Line counts by file:

      85 agency.txt
    1151 calendar.txt
    2736 calendar_dates.txt
    4434 pathways.txt
    1761 routes.txt
  230555 stop_extensions.txt
 12970342 stop_times.txt
   51052 stops.txt
  189757 transfers.txt
  567651 trips.txt
 14019524 total

Validating the file requires > 4 Go of RAM.

thbar commented 1 year ago

I wonder if there would be a way to make the validator work with a more somewhat constant RAM usage. I believe it's probably not easy given that everything is loaded in RAM, but maybe there is a possibility?

There are chances that this specific GTFS will keep increasing, if the public transportation offer continues improving.

AntoineAugusti commented 1 year ago

if the public transportation offer continues improving.

the file is bigger now because the IDFM scope is bigger and integrates more agencies 😏

AntoineAugusti commented 1 year ago

if the public transportation offer continues improving

we all wish it πŸ€žπŸ€

antoine-de commented 1 year ago

I'll try to do a memory profile to see if I see some quick wins, but the unziped data weight more than 1Gb, it will always take a bit of ram to validate this, especially since we need to maintain some temporary structures to speed up validation.

antoine-de commented 1 year ago

A first quick analyze with heaptrack (I can give the .zst file if someone also wants to dig):

So no quick win in the validator crate, the allocation is mostly done in gtfs-structure :confused:

Tristramg commented 1 year ago

The largest memory consumer is probably the StopTime list. Right now a RawStopTime is 136 bytes, so it’s not really worth optimizing here (like at some point in the past, the time was stored in a 16 bits integer not 32 to save some memory).

Using i8 instead of i32 for PickupDropOffType::Unknown brings the RawStopTime down to 112 bytes. Not sure that the memory gain is worth a fatal error if some user types "12345". With i16, it 120bytes I think we experimented with SmartStrings to avoid allocation on Ids, but I can’t remember the result.

Maby both combined we could save 100Mb ?

Maybe there is something to be gained in the conversion. Right now, StopTime::from(stop_time_gtfs: &RawStopTime, stop: Arc<Stop>) uses a reference. Could we consume the StopTime to avoid having two objects alive at the same time? (unrelated, but as I was looking into the code, if during the conversion we sort the rawStopTimes by trip_id, we could maybe have a performance gain, without having to lookup

Tristramg commented 1 year ago

I would be curious to know how much memory other validators consume, do you have an order of magnitude?

AntoineAugusti commented 1 year ago

@Tristramg The amazing @thbar did the test yesterday with MobilityData's validator, v4.1.

It took 24s to validate on his machine with a cap at 3 GB of RAM.

$ time java -jar -Xmx3072M gtfs-validator-4.1.0-cli.jar --input IDFM-gtfs.zip -o .
thbar commented 1 year ago

@Tristramg to be clear I have no idea how the validations reports compare :smile: I only looked at the consumption so far.

thbar commented 1 year ago

It took 24s to validate on his machine with a cap at 3 GB of RAM.

I compared both on my machine, and both validators perform similarly from the duration point of view.

antoine-de commented 1 year ago

The problem is that wee nee, during the create_trips method, to have the old trips/stop_times vec and the new, so the memory is doubled. I don't see an obvious way to better handle that (without sacrificing too much performance) :thinking:

antoine-de commented 1 year ago

The way I see it we have several choices:

antoine-de commented 11 months ago

When did a quick POC to try using a String pool to reduce the memory usage (here)

The idea is to use a shared ID to limit the memory usage of have the same stop ID in a great number of stop times.

The huge downside to side approach is that the String are never released, creating a infinite memory leak when GTFS structures is used in a everlasting application (like a web service).

A quick bechnmark on the IDFM dataset (for the dataset available on 20230915) show a 10 % memory reduction (and a small speedup, likely due to cheaper comparison operations).

Note: the cache takes 23M for this dataset (using https://docs.rs/ustr/latest/ustr/fn.total_allocated.html)

Tristramg commented 9 months ago

Did you test how the validator behaves now? Should we investigate further memory optimizations?

AntoineAugusti commented 9 months ago

@Tristramg @antoine-de It seems to be enough for now! We managed to scale down the instance type from L to M after the changes on Clever Cloud and we did not see reboots/OOM.

Great job πŸ‘

Fixed for now?

antoine-de commented 9 months ago

yes, we can close this, and reopen if needed.