ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 43 forks source link

Process stuck on a URL #44

Closed rossmounce closed 9 years ago

rossmounce commented 9 years ago

Previous issue arose from trying to diagnose this problem (the --url / -url thing wasn't the problem here). Here my bash:

while read i ; do quickscrape  --url $i --ratelimit 20 --scraper jscrapers/scrapers/ijsem.json  --output ./ijsem --outformat bibjson | tee log.log ; done <ijsemarticles.txt

It basically hung, without crash / exit on the 1000ish line of the ijsemarticles url list. I'll upload the list and link to it here in a bit.

$ tail log.log 
info: quickscrape launched with...
info: - URL: http://ijs.sgmjournals.org/content/64/Pt_5/1775.full
info: - Scraper: jscrapers/scrapers/ijsem.json
info: - Rate limit: 20 per minute
info: - Log level: info
info: urls to scrape: 1
info: processing URL: http://ijs.sgmjournals.org/content/64/Pt_5/1775.full
info: [scraper]. URL rendered. http://ijs.sgmjournals.org/content/64/Pt_5/1775.full.
info: [scraper]. URL rendered. http://ijs.sgmjournals.org/content/64/Pt_5/1775/suppl/DC1.

I used Ctrl-C to stop the process. Would be good to print the current time to screen in one of the loglevel settings so I know when it hung. I have no idea when it stopped chugging away

rossmounce commented 9 years ago

articles list here: https://github.com/rossmounce/quickscrapeijsem/blob/master/ijsemarticles.txt

It's interesting the log file only goes up to 64/1775 yet in the output folder directory it appears it got one or two further down the file list: subfolders for 64/1782 & 64/1802 are present

rossmounce commented 9 years ago

Problem definitely arose at 64/Pt_5/1775 : the output folder is empty except for a damaged/incomplete 83kb PDF file 56978.pdf (no .json files or anything else)

blahah commented 9 years ago

It seems there is no particular problem with scraping that URL - it worked fine for me. So there's a less well defined problem where quickscrape can hang. This is most likely something to do with handling network issues, which we could do better at.

I've made issues to track that (see above), so will close this one.