Closed homtec closed 7 years ago
See #45.
should be fixed with: https://github.com/brendannee/node-gtfs/pull/56
node-gtfs
to fully use streams in a non-blocking way. Splitting into chunks seems like a workaround.@derhuerst : can you explain in more detail how using streams would solve the memory overflow issue?
My understanding is that if we use file streams instead of parse
this will only reduce the amount of memory consumed by the file contents. However in our tests this did not seem to be what caused the memory overflow. That rather seems to be caused by the callbacks attached to each insert
operation.
Thus the solution implemented by @gerlacdt is not actually so much the division in chunks, but using insertMany
. The division in chunks is only done because the node js driver internally runs into the same issue if the chunk is too big.
Node streams are basically just event emitters that "hold" data. Streams can be connected to other streams (See .pipe()
). One can also read from streams manually (data
event). This alone is nothing special when dealing with large amounts of data (larger than memory). Streams have a backpressure mechanism, which keeps the amount of data "held" in memory low. If one uses streams properly (no blocking operations, no non-flowing stream usage]), Node.js scripts can deal with any amount of data. Consider the Stream Handbook for best practices.
In the build script, sync operations are being done. Also, all the data "held" in a stream is read into memory. This is why I think the Node process runs out of memory.
I think @balmoovel is right. The out-of-memory error was not caused by loading the files completely into memory. It was caused by the number of callbacks created in a loop.
Pseudo-Algorithm:
loop files (agency.txt, stops.txt, stop_times.txt) load file get csv lines loop lines mongodb.insert-callback(line)
For stop_times.txt this opened 2.6 millions callbacks asynchronously! Too many for node....
So switching to "streams" is really nice-to-have and more node-like but will not solve the problem.
By the way inserting millions of entries one-by-one is always a worse solution than inserting in batch-mode (10,000 at once)!
I think @balmoovel is right. The out-of-memory error was not caused by loading the files completely into memory. It was caused by the number of callbacks created in a loop.
I see this as a problem. For every file, it loads every file (or the part of it that could be loaded from disk) into memory. The data get's transformed and then written into the db. But as long as the communication with the db isn't done, it's still all in memory.
Since the script loads the data synchronously (via while
loop), there's no chance for the db layer to actual write stuff to the db while it receives more and more data.
For stop_times.txt this opened 2.6 millions callbacks asynchronously! Too many for node....
AFAIK that's not a problem. (Again, the data that is kept in-memory is a problem.)
So switching to "streams" is really nice-to-have and more node-like but will not solve the problem.
As I said, the script already uses streams, but not properly. Streams would solve the problem since they limit the amount of data being read from the files. Therefore, Node only needs to keep track of a small amount of data & callbacks.
By the way inserting millions of entries one-by-one is always a worse solution than inserting in batch-mode (10,000 at once)!
With very limitied memory (try running the script on a Raspberry PI or VPS), properly implemented streams (working one-by-one) are still more suited than batch operations. But nevertheless, one can combine streams and batch operations. (;
As I have said before
However in our tests this did not seem to be what caused the memory overflow. That rather seems to be caused by the callbacks attached to each insert operation.
We have tested this, and to our understanding reading the file into memory is not a problem on systems with gigabytes of memory. The problem are the millions callbacks. If you think that either our assumption is wrong or that you can provide a faster and more stable fix yourself then please do.
We have tested this, and to our understanding reading the file into memory is not a problem on systems with gigabytes of memory.
V8 has a default memory limit of 1.5GB. Also, this script should work on small VPS with just 512MB of memory.
If you think that either our assumption is wrong or that you can provide a faster and more stable fix yourself then please do.
I'm not willing to rewrite this build script, but I'd offer help doing so. I can also recommend things like promises and co
for making the script more readable.
You may also want to have a look how my Berlin-specific GTFS build script looks like.
@gerlacdt I tested your pull request and it worked.
I'm down to rewrite the download script with streams - its about time it had an overhaul to be more readable.
@brendannee
thx for merging!
Only for information: After the fix we discovered that we had again problems with out-of-memory-error for node-versions < 4.4.x. (Async.queue is really memory-hungry...)
So the best would be, rewriting the script with streams. Although i don't know if this will solve all problems because as far as i understand the csv.parser.on("readable") uses already a stream.
Happy coding!
@brendannee : if you should still run into memory issues with streams have a look at the latest https://github.com/moovel/node-gtfs, it works a bit more stable using recursion. Good luck and thanks a lot for your work!
For what it's worth, none of these solutions are working for me
http://www.stm.info/sites/default/files/gtfs/gtfs_stm.zip
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Hi,
proposed solutions above are not working for me either. I keep getting the same error:
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
@ those who cannot import their data using https://github.com/moovel/node-gtfs : can you post a link to the GTFS data you are trying to import?
Here is the link to the Chicago Transit Agency: http://www.transitchicago.com/downloads/sch_data/google_transit.zip
I found the link here: http://www.gtfs-data-exchange.com/agency/chicago-transit-authority/ Uploaded by cta-archiver on Apr 16 2016
I excluded calendar dates, fare attributes, fare rules, shapes, frequencies, feed_info; and it threw the error when processing stoptimes file.
Here is the link to the Bay Area Rapid Transit: http://www.bart.gov/dev/schedules/google_transit.zip
Found here: http://www.gtfs-data-exchange.com/agency/bay-area-rapid-transit/ Uploaded by bart-archiver on Apr 04 2016 02:31 I excluded calendar dates, fare attributes, fare rules, shapes, frequencies, feed_info; It threw the error when processing transfers.
I can confirm the issue - I did not find any super quick fix so far, sorry. I promise to look into it when I have some time but that can be weeks. :(
Here is the gtfs file I was working with
@brendannee @balmoovel Fatal error while processing a 3-million line file. Source: https://api.transport.nsw.gov.au/v1/publictransport/timetables/complete/gtfs My experience is the same as @nlambert's:
For what it's worth, none of these solutions are working for me
56
https://github.com/moovel/node-gtfs --optimize_for_size --max_old_space_size=2000
I just pushed an update that may help handle importing very large GTFS files.
Try out your large GTFS files and let me know what errors, if any, you get.
Thanks @brendannee, it's working great!
Closing this issue - please comment if you still have memory issues.
Importing a GTFS file always ends up with
ns=0xFATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory 16b2fe47a1e1 <String[14]: gtfs.stoptimes>,ops=0x24d80f184... Abort trap: 6
I also switched to a GTFS file with only 18MB, same result.