BlueBrain / Search

Blue Brain text mining toolbox for semantic search and structured information extraction
https://blue-brain-search.readthedocs.io
GNU Lesser General Public License v3.0
42 stars 11 forks source link

Investigate programmatic download of PMC full-text articles #360

Closed jankrepl closed 3 years ago

jankrepl commented 3 years ago

PubMed Central (PMC) has a huge (6 million+) collection of full-text records (articles?). They seem to have an FTP server that one can use to download the articles.

This issue should answer the following questions.

Although access to the material in PMC is free, the use of the material still is subject to the copyright and/or related license terms of the respective authors or publishers. See the PMC Copyright Notice for more information.

You may NOT use any kind of automated process to download articles in bulk from the main PMC site. PMC will block the access of any user who is found to be violating this policy.

However, there are a few Text Mining Collections within PMC where bulk retrieval of files for text mining and other purposes is permitted. License terms may vary by collection or even within a collection. To download a collection in PMC, you must use a designated service, such as the PMC FTP service. See the full listing of APIs that you can use for accessing PMC data on the PMC Developer Resources page.

EmilieDel commented 3 years ago

How did the creators of the Kaggle Covid-19 dataset download data from PMC? And can we fully recreate the full-text part of the Kaggle dataset ourselves?

Here could be a paper of interest. Some interesting take away (directly from the paper):

PMC page for covid-19 results.

EmilieDel commented 3 years ago

Collected information about PMC: https://bbpteam.epfl.ch/project/spaces/pages/viewpage.action?spaceKey=BBS&title=PMC

Stannislav commented 3 years ago

Are you OK with the Confluence page above? ^^^^

Stannislav commented 3 years ago

Thanks Emilie, very nice summary.

Here are some numbers on article counts I came across

From these numbers it seems that the OA subset covers most of the PMC, but in the PMC FAQ it says that the majority of PMC articles are not open acces:

The majority of the articles in PMC are subject to traditional copyright restrictions. They are free to access, but they are not Open Access articles in the specialized sense of that term.

So I'm not sure I understand.

It's interesting they provide tools for syncing new articles. Do you konw if this can be combined with the bulk download tools? So

But maybe downloading all files one-by-one can be done in reasonable time too, I don't know.

EmilieDel commented 3 years ago

Thanks for the feedback @Stannislav!

I have the same numbers in mind and the same interrogation as yours. It is not really clear from their website to be honest. Also, I do not know if their bulk download is containing the entire OA subset (which could maybe explain the FAQ extract).

I think it is really good suggestions. To better answer those questions, I think it is important to understand what we can get out of this bulk download. Do we have the entire OA subset ? If I understand correctly, bulk download give you access to huge .tar.gz files containing papers from different journals but that is all I could get. If needed, I can go further on the investigation for this part today!

pafonta commented 3 years ago

Regarding the PMC vs PMC OA subset:

The home page of PMC says there are 7 million records in PMC.

There is therefore much more records in PMC than in its OA subset, as expected.

FrancescoCasalegno commented 3 years ago

Great job! 👍