ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 42 forks source link

Hanging seemingly randomly when downloading a list of URLs #74

Open robintw opened 8 years ago

robintw commented 8 years ago

I am trying to run quickscrape to download a large number of papers from PLOS One. I've got a list of URLs to download and have run quickscrape as:

quickscrape -r /mnt/cm-volume/content-mine/PLOS_DOIs_2015.txt -s /mnt/cm-volume/content-mine/journal-scrapers/scrapers/plos.json -o /mnt/cm-volume/content-mine/plos-2015-new2/ -l debug

This seems to work fine for a while, but then the process just hangs after downloading a fulltext.xml file. For example, the end of the output (with debug logging turned on) looks like this:

data: [scraper]. element captured. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g001.
data: [scraper]. element captured. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g002.
data: [scraper]. element captured. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g003.
data: [scraper]. element captured. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g004.
data: [scraper]. element captured. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g005.
data: [scraper]. element captured. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g006.
data: [scraper]. element captured. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g007.
debug: [scraper]. element results. figures_image. article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g001,article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g002,article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g003,article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g004,article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g005,article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g006,article/figure/image?download&size=large&id=info:doi/10.1371/journal.pone.0114250.g007.
data: [scraper]. element capture failed. license.
debug: [scraper]. selector had no results. //span[contains(concat(' ', normalize-space(@class), ' '), ' license-p ')]. license.
debug: [scraper]. element results. license. .
data: [scraper]. element capture failed. copyright.
debug: [scraper]. selector had no results. //span[starts-with(@itemprop, 'copyright')]/... copyright.
debug: [scraper]. element results. copyright. .
info: [scraper]. download started. fulltext.xml.

I can't see any errors here, and if I try and run that particular URL (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0114250) by itself it works fine and downloads both the fulltext.xml and fulltext.pdf.

Does anyone have any idea what might be going on here? It is making it really hard to get a large corpus of articles to mine.

blahah commented 8 years ago

If you're downloading from PLOS you should really use getpapers instead, or download the bulk archive. Scraping is a last resort - it's a bit more server intensive and less reliable. And much slower!

But ignoring that, I'm not sure exactly what's happening from the output you've given. Is it always particular URLs that it hangs on, or is it seemingly random?

robintw commented 8 years ago

It is seemingly random, and all of the URLs that it seems to hang on then work fine if I run them individually.

I wasn't aware that getpapers could grab large volumes of PLOS papers - as far as I could see from the documentation it could only search things like EuropePMC, and I'm interested in getting non-biomedical-related papers from PLOS too (basically I'm trying to get all PLOS papers from 2015). Is there a way of doing this with getpapers?

Also, I hadn't heard of the PLOS bulk archive, and can't seem to find much about it on Google. Do you know where I could download a bulk archive from?

blahah commented 8 years ago

EuropePMC is not only for biomedical articles (the name is misleading). All of PLOS is there: http://europepmc.org/search?query=%28PUBLISHER:%22Public+Library+of+Science%22%29&page=1.

To get all PLOS papers from 2015 you would do:

--query '(PUBLISHER:"Public Library of Science") AND (FIRST_PDATE:[2015-01-01 TO 2015-12-31])'
robintw commented 8 years ago

Ah that's great - thank you. I have that running on my server now :-)

It'd still be good to try and work out what is going on with quickscrape sometime, as most of the journals I'm trying to scrape aren't available nice and easily like PLOS... I just have no idea where to start with the debugging...maybe I just need to drop print statements everywhere in the code and see if I can get an idea of when it hangs.

robintw commented 8 years ago

Also, I'm struggling to download very large numbers of papers with getpapers (I've commented on an issue, and think I may have found a workaround) - so I'm intrigued: what was the bulk archive you mentioned?

blahah commented 8 years ago

I don't have time to debug today I'm afraid. If you go to the PubMed FTP, you can find a bunch of archives called A-C...tar.gz and so on. The one with a range that covers P will contain all PLOS papers, one archive per journal.

petermr commented 8 years ago

The following URL hangs quickscrape:

http://www.tandfonline.com/doi/full/10.13039/501100005071

It's an unresolvable URL ("The requested article is not currently available on this site".) but it should time out and move on.