fsingletonthorn / EffectSizeScraping

MIT License
1 stars 0 forks source link

Look for PDF when XML doesn't get returned from PubMed #20

Closed fsingletonthorn closed 5 years ago

fsingletonthorn commented 5 years ago

Probably use the ftp service, download the full .tar.gz and then delete afterwards? Seems like it might not be possible to download the file itself. The other way to do it would be to use the bulk download section, only issue being the xml scheme seems to be somewhat different (I'm 95% sure that the current set up is robust to that issue, but it would need to be tested). The "articles" object from reading_oa_file_list.R has the required information to download the packages from the ftp service - e.g., articles$File[1]; "oa_package/d0/51/PMC29100.tar.gz" The ftb service url is ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/ Which means "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/d0/51/PMC29100.tar.gz" downloads the full file with the images and .nxml files. It would be helpful to also figure out if the labelling function can work first.

fsingletonthorn commented 5 years ago

Taking a look at a few it should be possible to download extract and find the .pdf for those that do not have xml text and do a text scrape pass through them. Maybe not possible to get section headers reliably.

fsingletonthorn commented 5 years ago

This is built but I need to build a test in