Open robintw opened 8 years ago
If you're downloading from PLOS you should really use getpapers instead, or download the bulk archive. Scraping is a last resort - it's a bit more server intensive and less reliable. And much slower!
But ignoring that, I'm not sure exactly what's happening from the output you've given. Is it always particular URLs that it hangs on, or is it seemingly random?
It is seemingly random, and all of the URLs that it seems to hang on then work fine if I run them individually.
I wasn't aware that getpapers could grab large volumes of PLOS papers - as far as I could see from the documentation it could only search things like EuropePMC, and I'm interested in getting non-biomedical-related papers from PLOS too (basically I'm trying to get all PLOS papers from 2015). Is there a way of doing this with getpapers?
Also, I hadn't heard of the PLOS bulk archive, and can't seem to find much about it on Google. Do you know where I could download a bulk archive from?
EuropePMC is not only for biomedical articles (the name is misleading). All of PLOS is there: http://europepmc.org/search?query=%28PUBLISHER:%22Public+Library+of+Science%22%29&page=1.
To get all PLOS papers from 2015 you would do:
--query '(PUBLISHER:"Public Library of Science") AND (FIRST_PDATE:[2015-01-01 TO 2015-12-31])'
Ah that's great - thank you. I have that running on my server now :-)
It'd still be good to try and work out what is going on with quickscrape sometime, as most of the journals I'm trying to scrape aren't available nice and easily like PLOS... I just have no idea where to start with the debugging...maybe I just need to drop print statements everywhere in the code and see if I can get an idea of when it hangs.
Also, I'm struggling to download very large numbers of papers with getpapers (I've commented on an issue, and think I may have found a workaround) - so I'm intrigued: what was the bulk archive you mentioned?
I don't have time to debug today I'm afraid. If you go to the PubMed FTP, you can find a bunch of archives called A-C...tar.gz
and so on. The one with a range that covers P will contain all PLOS papers, one archive per journal.
The following URL hangs quickscrape
:
http://www.tandfonline.com/doi/full/10.13039/501100005071
It's an unresolvable URL ("The requested article is not currently available on this site".) but it should time out and move on.
I am trying to run quickscrape to download a large number of papers from PLOS One. I've got a list of URLs to download and have run quickscrape as:
This seems to work fine for a while, but then the process just hangs after downloading a
fulltext.xml
file. For example, the end of the output (with debug logging turned on) looks like this:I can't see any errors here, and if I try and run that particular URL (
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0114250
) by itself it works fine and downloads both thefulltext.xml
andfulltext.pdf
.Does anyone have any idea what might be going on here? It is making it really hard to get a large corpus of articles to mine.