Closed AntoineAugusti closed 9 months ago
I wonder if there would be a way to make the validator work with a more somewhat constant RAM usage. I believe it's probably not easy given that everything is loaded in RAM, but maybe there is a possibility?
There are chances that this specific GTFS will keep increasing, if the public transportation offer continues improving.
if the public transportation offer continues improving.
the file is bigger now because the IDFM scope is bigger and integrates more agencies π
if the public transportation offer continues improving
we all wish it π€π
I'll try to do a memory profile to see if I see some quick wins, but the unziped data weight more than 1Gb, it will always take a bit of ram to validate this, especially since we need to maintain some temporary structures to speed up validation.
A first quick analyze with heaptrack (I can give the .zst file if someone also wants to dig):
So no quick win in the validator crate, the allocation is mostly done in gtfs-structure :confused:
The largest memory consumer is probably the StopTime list. Right now a RawStopTime
is 136 bytes, so itβs not really worth optimizing here (like at some point in the past, the time was stored in a 16 bits integer not 32 to save some memory).
Using i8
instead of i32
for PickupDropOffType::Unknown
brings the RawStopTime
down to 112 bytes. Not sure that the memory gain is worth a fatal error if some user types "12345". With i16, it 120bytes
I think we experimented with SmartStrings to avoid allocation on Ids, but I canβt remember the result.
Maby both combined we could save 100Mb ?
Maybe there is something to be gained in the conversion. Right now, StopTime::from(stop_time_gtfs: &RawStopTime, stop: Arc<Stop>)
uses a reference. Could we consume the StopTime to avoid having two objects alive at the same time?
(unrelated, but as I was looking into the code, if during the conversion we sort the rawStopTimes by trip_id, we could maybe have a performance gain, without having to lookup
I would be curious to know how much memory other validators consume, do you have an order of magnitude?
@Tristramg The amazing @thbar did the test yesterday with MobilityData's validator, v4.1.
It took 24s to validate on his machine with a cap at 3 GB of RAM.
$ time java -jar -Xmx3072M gtfs-validator-4.1.0-cli.jar --input IDFM-gtfs.zip -o .
@Tristramg to be clear I have no idea how the validations reports compare :smile: I only looked at the consumption so far.
It took 24s to validate on his machine with a cap at 3 GB of RAM.
I compared both on my machine, and both validators perform similarly from the duration point of view.
The problem is that wee nee, during the create_trips
method, to have the old trips/stop_times vec and the new, so the memory is doubled.
I don't see an obvious way to better handle that (without sacrificing too much performance) :thinking:
The way I see it we have several choices:
RawGTFS
everywhere in the validator, the max memory will be ~50% from the current peak, and more rules will be run (less Fatal
errors). but the ergonomic for some rules will be worth, and the RawGTFS -> Gtfs conversion brings some checks, and it will be a pity not to have them (but maybe we can skip this conversion for the big dataset :man_shrugging: ?)RawGtfs.stop_times
container to something like the c++ dequeu we'll be able to release part of the RawGtfs.stop_times
while allocating the new Gtfs.stop_time
? (but rust's VecDeque is not the same, it's just a ring bufferVec<RawStopTime>
to Vec<StopTime>
but I checked and it does not seems possible (especially since RawStopTime
is bigger than a StopTime
When did a quick POC to try using a String pool to reduce the memory usage (here)
The idea is to use a shared ID to limit the memory usage of have the same stop ID in a great number of stop times.
The huge downside to side approach is that the String are never released, creating a infinite memory leak when GTFS structures is used in a everlasting application (like a web service).
A quick bechnmark on the IDFM dataset (for the dataset available on 20230915) show a 10 % memory reduction (and a small speedup, likely due to cheaper comparison operations).
Note: the cache takes 23M for this dataset (using https://docs.rs/ustr/latest/ustr/fn.total_allocated.html)
Did you test how the validator behaves now? Should we investigate further memory optimizations?
@Tristramg @antoine-de It seems to be enough for now! We managed to scale down the instance type from L
to M
after the changes on Clever Cloud and we did not see reboots/OOM.
Great job π
Fixed for now?
yes, we can close this, and reopen if needed.
The IDFM GTFS starts to be quite big. Today, 2023-08-16, the ZIP archive is 103 Mo.
Line counts by file:
Validating the file requires > 4 Go of RAM.