Memory efficient validation

AntoineAugusti commented 1 year ago

The IDFM GTFS starts to be quite big. Today, 2023-08-16, the ZIP archive is 103 Mo.

Line counts by file:

      85 agency.txt
    1151 calendar.txt
    2736 calendar_dates.txt
    4434 pathways.txt
    1761 routes.txt
  230555 stop_extensions.txt
 12970342 stop_times.txt
   51052 stops.txt
  189757 transfers.txt
  567651 trips.txt
 14019524 total

Validating the file requires > 4 Go of RAM.

Would it be possible to lower this amount easily or would it be a big project?
Which steps require high RAM usage?

thbar commented 1 year ago

I wonder if there would be a way to make the validator work with a more somewhat constant RAM usage. I believe it's probably not easy given that everything is loaded in RAM, but maybe there is a possibility?

There are chances that this specific GTFS will keep increasing, if the public transportation offer continues improving.

AntoineAugusti commented 1 year ago

if the public transportation offer continues improving.

the file is bigger now because the IDFM scope is bigger and integrates more agencies 😏

AntoineAugusti commented 1 year ago

if the public transportation offer continues improving

we all wish it 🤞🍀

antoine-de commented 1 year ago

I'll try to do a memory profile to see if I see some quick wins, but the unziped data weight more than 1Gb, it will always take a bit of ram to validate this, especially since we need to maintain some temporary structures to speed up validation.

antoine-de commented 1 year ago

A first quick analyze with heaptrack (I can give the .zst file if someone also wants to dig):

4.4G taken as a whole
- 2.3G to read the RawGtfs
- 2.1G for the validation, but it's almost exclusively the conversion between RawGTFS and GTFS (and in the create_trips method)

So no quick win in the validator crate, the allocation is mostly done in gtfs-structure :confused:

Tristramg commented 1 year ago

The largest memory consumer is probably the StopTime list. Right now a RawStopTime is 136 bytes, so it’s not really worth optimizing here (like at some point in the past, the time was stored in a 16 bits integer not 32 to save some memory).

Using i8 instead of i32 for PickupDropOffType::Unknown brings the RawStopTime down to 112 bytes. Not sure that the memory gain is worth a fatal error if some user types "12345". With i16, it 120bytes I think we experimented with SmartStrings to avoid allocation on Ids, but I can’t remember the result.

Maby both combined we could save 100Mb ?

Maybe there is something to be gained in the conversion. Right now, StopTime::from(stop_time_gtfs: &RawStopTime, stop: Arc<Stop>) uses a reference. Could we consume the StopTime to avoid having two objects alive at the same time? (unrelated, but as I was looking into the code, if during the conversion we sort the rawStopTimes by trip_id, we could maybe have a performance gain, without having to lookup

Tristramg commented 1 year ago

I would be curious to know how much memory other validators consume, do you have an order of magnitude?

AntoineAugusti commented 1 year ago

@Tristramg The amazing @thbar did the test yesterday with MobilityData's validator, v4.1.

It took 24s to validate on his machine with a cap at 3 GB of RAM.

$ time java -jar -Xmx3072M gtfs-validator-4.1.0-cli.jar --input IDFM-gtfs.zip -o .

thbar commented 1 year ago

@Tristramg to be clear I have no idea how the validations reports compare :smile: I only looked at the consumption so far.

thbar commented 1 year ago

It took 24s to validate on his machine with a cap at 3 GB of RAM.

I compared both on my machine, and both validators perform similarly from the duration point of view.

antoine-de commented 1 year ago

The problem is that wee nee, during the create_trips method, to have the old trips/stop_times vec and the new, so the memory is doubled. I don't see an obvious way to better handle that (without sacrificing too much performance) :thinking:

antoine-de commented 1 year ago

The way I see it we have several choices:

let it like this, and accept the memory footprint
Use RawGTFS everywhere in the validator, the max memory will be ~50% from the current peak, and more rules will be run (less Fatal errors). but the ergonomic for some rules will be worth, and the RawGTFS -> Gtfs conversion brings some checks, and it will be a pity not to have them (but maybe we can skip this conversion for the big dataset :man_shrugging: ?)
Find some way to improve things as it is:
- Maybe if we change the RawGtfs.stop_times container to something like the c++ dequeu we'll be able to release part of the RawGtfs.stop_times while allocating the new Gtfs.stop_time? (but rust's VecDeque is not the same, it's just a ring buffer
- maybe we can find an usafe way to change the Vec<RawStopTime> to Vec<StopTime> but I checked and it does not seems possible (especially since RawStopTime is bigger than a StopTime

antoine-de commented 11 months ago

When did a quick POC to try using a String pool to reduce the memory usage (here)

The idea is to use a shared ID to limit the memory usage of have the same stop ID in a great number of stop times.

The huge downside to side approach is that the String are never released, creating a infinite memory leak when GTFS structures is used in a everlasting application (like a web service).

A quick bechnmark on the IDFM dataset (for the dataset available on 20230915) show a 10 % memory reduction (and a small speedup, likely due to cheaper comparison operations).

Note: the cache takes 23M for this dataset (using https://docs.rs/ustr/latest/ustr/fn.total_allocated.html)

Tristramg commented 9 months ago

Did you test how the validator behaves now? Should we investigate further memory optimizations?

AntoineAugusti commented 9 months ago

@Tristramg @antoine-de It seems to be enough for now! We managed to scale down the instance type from L to M after the changes on Clever Cloud and we did not see reboots/OOM.

Great job 👏

Fixed for now?

antoine-de commented 9 months ago

yes, we can close this, and reopen if needed.

etalab / transport-validator

Memory efficient validation #172