Investigate programmatic download of PMC full-text articles

jankrepl commented 3 years ago

PubMed Central (PMC) has a huge (6 million+) collection of full-text records (articles?). They seem to have an FTP server that one can use to download the articles.

This issue should answer the following questions.

[ ] Is using the FTP the recommended way of bulk downloading articles from PMC?
[ ] Would web scraping make sense (could it give us access to more articles than the FTP)?
[ ] What portion of the PMC records can be downloaded via the FTP? Potentially useful link
[ ] Are there any limitations (number of downloads per day, legal restrictions,...)? From the FAQ

Although access to the material in PMC is free, the use of the material still is subject to the copyright and/or related license terms of the respective authors or publishers. See the PMC Copyright Notice for more information.

You may NOT use any kind of automated process to download articles in bulk from the main PMC site. PMC will block the access of any user who is found to be violating this policy.

However, there are a few Text Mining Collections within PMC where bulk retrieval of files for text mining and other purposes is permitted. License terms may vary by collection or even within a collection. To download a collection in PMC, you must use a designated service, such as the PMC FTP service. See the full listing of APIs that you can use for accessing PMC data on the PMC Developer Resources page.

[ ] How did the creators of the Kaggle Covid-19 dataset download data from PMC? And can we fully recreate the full-text part of the Kaggle dataset ourselves?
[ ] Are there some existing tools (e.g. on github) that simplify the FTP download + parsing?
- https://github.com/billgreenwald/Pubmed-Batch-Download
- https://gitlab.com/ncbipy/entrezpy
[ ] How much custom code would need to be written (both the download and potential parsing of ?XML files)? Can we reuse existing bluesearch code (custom made for Kaggle data) to deal with raw PMC data?
[ ] Should we make it a part of our source code?

EmilieDel commented 3 years ago

How did the creators of the Kaggle Covid-19 dataset download data from PMC? And can we fully recreate the full-text part of the Kaggle dataset ourselves?

Here could be a paper of interest. Some interesting take away (directly from the paper):

Papers in CORD-19 are sourced from PubMedCentral (PMC), PubMed, the World Health Organization’s Covid-19 Database, and preprint servers bioRxiv, medRxiv, and arXiv.
Papers that match on given keywords (coronavirus, ..) in their title, abstract, or body text are included in the dataset.
They clustered papers if they overlap on any of the following identifiers: {doi, pmc_id, pubmed_id, arxiv_id, who_covidence_id, mag_id}.
Regarding processing of texts, they parsed PDF's -Grobid-> TEI XML --> S2ORC JSON (format for representing scientific paper full text). They also parsed JATS XML to that same format using custom parser.

PMC page for covid-19 results.

EmilieDel commented 3 years ago

Collected information about PMC: https://bbpteam.epfl.ch/project/spaces/pages/viewpage.action?spaceKey=BBS&title=PMC

Stannislav commented 3 years ago

Are you OK with the Confluence page above? ^^^^

[x] @jankrepl
[x] @pafonta
[x] @FrancescoCasalegno
[x] @Stannislav

Stannislav commented 3 years ago

Thanks Emilie, very nice summary.

Here are some numbers on article counts I came across

PubMed: > 26 million (source)
PMC: > 3 million (source)
PMC OA Subset: ~ 2.75 million (source)
wc -l oa_file_list.txt = 3571537

From these numbers it seems that the OA subset covers most of the PMC, but in the PMC FAQ it says that the majority of PMC articles are not open acces:

The majority of the articles in PMC are subject to traditional copyright restrictions. They are free to access, but they are not Open Access articles in the specialized sense of that term.

So I'm not sure I understand.

It's interesting they provide tools for syncing new articles. Do you konw if this can be combined with the bulk download tools? So

download the currently available articles with bulk download tools
sync new files with the sync tools (setting up a cron job sounds interesting too)

But maybe downloading all files one-by-one can be done in reasonable time too, I don't know.

EmilieDel commented 3 years ago

Thanks for the feedback @Stannislav!

I have the same numbers in mind and the same interrogation as yours. It is not really clear from their website to be honest. Also, I do not know if their bulk download is containing the entire OA subset (which could maybe explain the FAQ extract).

I think it is really good suggestions. To better answer those questions, I think it is important to understand what we can get out of this bulk download. Do we have the entire OA subset ? If I understand correctly, bulk download give you access to huge .tar.gz files containing papers from different journals but that is all I could get. If needed, I can go further on the investigation for this part today!

pafonta commented 3 years ago

Regarding the PMC vs PMC OA subset:

The home page of PMC says there are 7 million records in PMC.

There is therefore much more records in PMC than in its OA subset, as expected.

FrancescoCasalegno commented 3 years ago

Great job! 👍

BlueBrain / Search

Investigate programmatic download of PMC full-text articles #360

How did the creators of the Kaggle Covid-19 dataset download data from PMC? And can we fully recreate the full-text part of the Kaggle dataset ourselves?