afimb / gtfslib-python

An open source library in python for reading GTFS files and computing various stats and indicators about Public Transport networks
GNU General Public License v3.0
44 stars 6 forks source link

Extremely high memory usage when loading a big GTFS feed #55

Closed elad661 closed 7 years ago

elad661 commented 7 years ago

I'm trying to load this feed ftp://gtfs.mot.gov.il/israel-public-transportation.zip using gtfsdbloader. The zip file is approximately 1GB when decompressed.

gtfsdbloader works okay until it gets to loading of shape points, during which it uses all the memory I have available (16GB) and the system starts swapping. If I let it run for a while, the memory usage keeps growing, which would probably invoke the OOM killer once I run out of swap.

Is there any way to fix it apart from getting a machine with 32GB of RAM?

laurentg commented 7 years ago

Where does it consume memory exactly? During shape import? You can try with the following PR: https://github.com/afimb/gtfslib-python/pull/46, it's meant to optimize memory usage.

elad661 commented 7 years ago

I just tested, and this PR does indeed fix the problem when loading shapes (and makes that step faster, too), however once it gets to normalizing them memory usage grows rapidly again...

laurentg commented 7 years ago

@elad661 You can try the branch https://github.com/afimb/gtfslib-python/tree/fix-55, it's meant to optimize memory usage.

With lots of shapes and points, we use a lot of memory during the normalization process, as we are keeping track of the shape_point.shape_dist "old vs new" mapping for every shape in order to re-normalize stop_times.dist_traveled. In this branch we process shape and trip normalization at the same time, keeping track of only the current shape mapping. It's slower because we need to load trips shape by shapes, but memory usage should be kept much lower.

We still keep a stop distance cache, if needed we can also disable it.

elad661 commented 7 years ago

I might give it a try later, although I'm not sure it's worth it to slow it for everyone just for my (uncommon) usecase of using a really huge feed.

laurentg commented 7 years ago

No, we have to be able to handle arbitrary-sized GTFS anyway. I do not think the slowdown is very important, data loading is not the most expensive part of the process. You can try with the latest commits on this same branch, I solved another issue (not using paging for shape+points loading in the normalization process).

laurentg commented 7 years ago

Be aware that the normalization take lots of time. It could be an idea to be able to disable it if the data is already normalized (shape_dist in meters and all stop times present).

elad661 commented 7 years ago

I tried the branch, and it does seem to solve the memory issue. However, you're right about it taking "lots of time", it ran for more than 6 hours... until I had a power outage. I don't have time to run it until completion today, but I might try again tomorrow.

Perhaps a switch to disable normalization would be a good idea. (I'm still using your library, btw, I worked around the lengthy import issue by writing a script to filter out agencies I don't care about from the GTFS file. But this workaround only works for my current project, and won't work in the future if I'll need more than one agency).

laurentg commented 7 years ago

If you do not need to interpolate missing stop times and do not rely on stop_times.shape_dist_traveled then stop_times / shape normalization can be skipped. I'll add the option, in the meantime you can just disable the corresponding code.

laurentg commented 7 years ago

@elad661 The option to disable normalization is implemented, see #60.