ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 43 forks source link

FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory #42

Open rossmounce opened 9 years ago

rossmounce commented 9 years ago

A re-occurrence of https://github.com/ContentMine/quickscrape/issues/9 ?

I fed it a list of ~3000 PNAS full text URLs last night and it choked after just 255 No crash file generated in /var/crash/

(quickscrape 0.4.2)

info: processing URL: http://www.pnas.org/content/100/19/10866.full
info: [scraper]. URL rendered. http://www.pnas.org/content/100/19/10866.full.
FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory
Aborted (core dumped)
$

input URLs: http://www.pnas.org/content/100/19/10860.full (262nd) http://www.pnas.org/content/100/19/10866.full (263rd) http://www.pnas.org/content/100/19/10872.full (264th)

output folder of 10860 is created and full of files as expected, of 10866 is created but empty

the other slightly worrying thing is it progressed to the 263rd URL, but there are only 255 output folders in the output directory. Is there an option to keep a logfile of the scrape, in-built?

input commands:

$ quickscrape --urllist 3000PNAS.txt --scraper journal-scrapers/scrapers/pnas.json --output cm --outformat bibjson --ratelimit 20

Nothing to do with that particular URL. It works fine when quickscraping that one individually (I've tried, it works).

blahah commented 9 years ago

Hmm, we are leaking memory somewhere when doing --urllist. I will do some tracing to tidy it up.

Regarding keeping logs, just redirect the output to a file. Different levels of log event are output to different handles, so info and data events go to STDOUT, while debug, warn and error go to STDERR. That means you can save the different bits differently if you want:

quickscrape --urllist blah.txt --scraper blah.json > log.txt # saves info and data to log.txt, writes error, debug and warn to terminal
quickscrape --urllist blah.txt --scraper blah.json 2> log.txt # saves error, debug and warn to log.txt, writes info and data to terminal
quickscrape --urllist blah.txt --scraper blah.json &> log.txt # saves all logs to log.txt

and if you want to both see and save the logs, you can use tee:

quickscrape --urllist blah.txt --scraper blah.json 2>&1 | tee log.txt
blahah commented 9 years ago

I'm tracing this leak now.

Using memwatch I've identified that there's a pretty serious leak happening:

Memory leak detected:  { start: Sun May 17 2015 18:12:33 GMT+0100 (BST),
  end: Sun May 17 2015 18:15:30 GMT+0100 (BST),
  growth: 40908664,
  reason: 'heap growth over 5 consecutive GCs (2m 57s) - 793.5 mb/hr' }

Now the task is to figure out what's causing it. Profiling wasn't much help, as pretty much every variable created in the entire code is not being garbage collected. This suggests it's something high-level, like perhaps the way I iterate through the URLs using recursive callbacks. I'm going to try a different looping approach (not trivial because of the rate-limiting).