Closed jankrepl closed 3 years ago
Here could be a paper of interest. Some interesting take away (directly from the paper):
coronavirus
, ..) in their title, abstract, or body text are included in the dataset. {doi, pmc_id, pubmed_id, arxiv_id, who_covidence_id, mag_id}
.PMC page for covid-19 results.
Collected information about PMC: https://bbpteam.epfl.ch/project/spaces/pages/viewpage.action?spaceKey=BBS&title=PMC
Are you OK with the Confluence page above? ^^^^
Thanks Emilie, very nice summary.
Here are some numbers on article counts I came across
wc -l oa_file_list.txt
= 3571537
From these numbers it seems that the OA subset covers most of the PMC, but in the PMC FAQ it says that the majority of PMC articles are not open acces:
The majority of the articles in PMC are subject to traditional copyright restrictions. They are free to access, but they are not Open Access articles in the specialized sense of that term.
So I'm not sure I understand.
It's interesting they provide tools for syncing new articles. Do you konw if this can be combined with the bulk download tools? So
But maybe downloading all files one-by-one can be done in reasonable time too, I don't know.
Thanks for the feedback @Stannislav!
I have the same numbers in mind and the same interrogation as yours. It is not really clear from their website to be honest. Also, I do not know if their bulk download is containing the entire OA subset (which could maybe explain the FAQ extract).
I think it is really good suggestions. To better answer those questions, I think it is important to understand what we can get out of this bulk download. Do we have the entire OA subset ? If I understand correctly, bulk download give you access to huge .tar.gz files containing papers from different journals but that is all I could get. If needed, I can go further on the investigation for this part today!
Regarding the PMC vs PMC OA subset:
The home page of PMC says there are 7 million
records in PMC.
There is therefore much more records in PMC
than in its OA subset, as expected.
Great job! 👍
PubMed Central (PMC) has a huge (6 million+) collection of full-text records (articles?). They seem to have an FTP server that one can use to download the articles.
This issue should answer the following questions.
bluesearch
code (custom made for Kaggle data) to deal with raw PMC data?