BlinkTagInc / node-gtfs

Import GTFS transit data into SQLite and query routes, stops, times, fares and more.
MIT License
435 stars 149 forks source link

Out of memory during import #55

Closed homtec closed 7 years ago

homtec commented 8 years ago

Importing a GTFS file always ends up with

ns=0xFATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory 16b2fe47a1e1 <String[14]: gtfs.stoptimes>,ops=0x24d80f184... Abort trap: 6

I also switched to a GTFS file with only 18MB, same result.

derhuerst commented 8 years ago

See #45.

gerlacdt commented 8 years ago

should be fixed with: https://github.com/brendannee/node-gtfs/pull/56

derhuerst commented 8 years ago

56 is not a permanent fix. The proper solution is to write node-gtfs to fully use streams in a non-blocking way. Splitting into chunks seems like a workaround.

balmoovel commented 8 years ago

@derhuerst : can you explain in more detail how using streams would solve the memory overflow issue?

My understanding is that if we use file streams instead of parse this will only reduce the amount of memory consumed by the file contents. However in our tests this did not seem to be what caused the memory overflow. That rather seems to be caused by the callbacks attached to each insert operation.

Thus the solution implemented by @gerlacdt is not actually so much the division in chunks, but using insertMany. The division in chunks is only done because the node js driver internally runs into the same issue if the chunk is too big.

derhuerst commented 8 years ago

Node streams are basically just event emitters that "hold" data. Streams can be connected to other streams (See .pipe()). One can also read from streams manually (data event). This alone is nothing special when dealing with large amounts of data (larger than memory). Streams have a backpressure mechanism, which keeps the amount of data "held" in memory low. If one uses streams properly (no blocking operations, no non-flowing stream usage]), Node.js scripts can deal with any amount of data. Consider the Stream Handbook for best practices.

In the build script, sync operations are being done. Also, all the data "held" in a stream is read into memory. This is why I think the Node process runs out of memory.

gerlacdt commented 8 years ago

I think @balmoovel is right. The out-of-memory error was not caused by loading the files completely into memory. It was caused by the number of callbacks created in a loop.

Pseudo-Algorithm:

loop files (agency.txt, stops.txt, stop_times.txt) load file get csv lines loop lines mongodb.insert-callback(line)

For stop_times.txt this opened 2.6 millions callbacks asynchronously! Too many for node....

So switching to "streams" is really nice-to-have and more node-like but will not solve the problem.

By the way inserting millions of entries one-by-one is always a worse solution than inserting in batch-mode (10,000 at once)!

derhuerst commented 8 years ago

I think @balmoovel is right. The out-of-memory error was not caused by loading the files completely into memory. It was caused by the number of callbacks created in a loop.

I see this as a problem. For every file, it loads every file (or the part of it that could be loaded from disk) into memory. The data get's transformed and then written into the db. But as long as the communication with the db isn't done, it's still all in memory.

Since the script loads the data synchronously (via while loop), there's no chance for the db layer to actual write stuff to the db while it receives more and more data.

For stop_times.txt this opened 2.6 millions callbacks asynchronously! Too many for node....

AFAIK that's not a problem. (Again, the data that is kept in-memory is a problem.)

So switching to "streams" is really nice-to-have and more node-like but will not solve the problem.

As I said, the script already uses streams, but not properly. Streams would solve the problem since they limit the amount of data being read from the files. Therefore, Node only needs to keep track of a small amount of data & callbacks.

By the way inserting millions of entries one-by-one is always a worse solution than inserting in batch-mode (10,000 at once)!

With very limitied memory (try running the script on a Raspberry PI or VPS), properly implemented streams (working one-by-one) are still more suited than batch operations. But nevertheless, one can combine streams and batch operations. (;

balmoovel commented 8 years ago

As I have said before

However in our tests this did not seem to be what caused the memory overflow. That rather seems to be caused by the callbacks attached to each insert operation.

We have tested this, and to our understanding reading the file into memory is not a problem on systems with gigabytes of memory. The problem are the millions callbacks. If you think that either our assumption is wrong or that you can provide a faster and more stable fix yourself then please do.

derhuerst commented 8 years ago

We have tested this, and to our understanding reading the file into memory is not a problem on systems with gigabytes of memory.

V8 has a default memory limit of 1.5GB. Also, this script should work on small VPS with just 512MB of memory.

If you think that either our assumption is wrong or that you can provide a faster and more stable fix yourself then please do.

I'm not willing to rewrite this build script, but I'd offer help doing so. I can also recommend things like promises and co for making the script more readable.

You may also want to have a look how my Berlin-specific GTFS build script looks like.

brendannee commented 8 years ago

@gerlacdt I tested your pull request and it worked.

I'm down to rewrite the download script with streams - its about time it had an overhaul to be more readable.

gerlacdt commented 8 years ago

@brendannee

thx for merging!

Only for information: After the fix we discovered that we had again problems with out-of-memory-error for node-versions < 4.4.x. (Async.queue is really memory-hungry...)

So the best would be, rewriting the script with streams. Although i don't know if this will solve all problems because as far as i understand the csv.parser.on("readable") uses already a stream.

Happy coding!

balmoovel commented 8 years ago

@brendannee : if you should still run into memory issues with streams have a look at the latest https://github.com/moovel/node-gtfs, it works a bit more stable using recursion. Good luck and thanks a lot for your work!

nlambert commented 8 years ago

For what it's worth, none of these solutions are working for me

http://www.stm.info/sites/default/files/gtfs/gtfs_stm.zip

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory

melisoner2006 commented 8 years ago

Hi, proposed solutions above are not working for me either. I keep getting the same error: FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory

balmoovel commented 8 years ago

@ those who cannot import their data using https://github.com/moovel/node-gtfs : can you post a link to the GTFS data you are trying to import?

melisoner2006 commented 8 years ago

Here is the link to the Chicago Transit Agency: http://www.transitchicago.com/downloads/sch_data/google_transit.zip

I found the link here: http://www.gtfs-data-exchange.com/agency/chicago-transit-authority/ Uploaded by cta-archiver on Apr 16 2016

I excluded calendar dates, fare attributes, fare rules, shapes, frequencies, feed_info; and it threw the error when processing stoptimes file.

Here is the link to the Bay Area Rapid Transit: http://www.bart.gov/dev/schedules/google_transit.zip

Found here: http://www.gtfs-data-exchange.com/agency/bay-area-rapid-transit/ Uploaded by bart-archiver on Apr 04 2016 02:31 I excluded calendar dates, fare attributes, fare rules, shapes, frequencies, feed_info; It threw the error when processing transfers.

balmoovel commented 8 years ago

I can confirm the issue - I did not find any super quick fix so far, sorry. I promise to look into it when I have some time but that can be weeks. :(

nlambert commented 8 years ago

Here is the gtfs file I was working with

http://www.amt.qc.ca/xdata/trains/google_transit.zip

senpai-notices commented 7 years ago

@brendannee @balmoovel Fatal error while processing a 3-million line file. Source: https://api.transport.nsw.gov.au/v1/publictransport/timetables/complete/gtfs My experience is the same as @nlambert's:

For what it's worth, none of these solutions are working for me

56

https://github.com/moovel/node-gtfs --optimize_for_size --max_old_space_size=2000

brendannee commented 7 years ago

I just pushed an update that may help handle importing very large GTFS files.

Try out your large GTFS files and let me know what errors, if any, you get.

senpai-notices commented 7 years ago

Thanks @brendannee, it's working great!

brendannee commented 7 years ago

Closing this issue - please comment if you still have memory issues.