jankrepl commented 3 years ago

🚀 Feature

Double check licencing
Investigate, whether it is possible to bulk download PubMed "entries" (note that they do not contain the full text)
If possible, also potentially implement the parser
Think about how to make it clear that the entry does not contain the full text

Motivation

Pitch

Alternatives

?

Additional context

By PubMed we mean the database of article abstracts + metadata. See https://pubmed.ncbi.nlm.nih.gov/

PubMed® comprises more than 32 million citations for biomedical literature...

pafonta commented 3 years ago

Licence

CHECKED

Several `conditions` should be fulfilled to use the data. They are described [here](https://www.nlm.nih.gov/databases/download/terms_and_conditions_pubmed.html). _Please note some PubMed/MEDLINE abstracts may be protected by `copyright`._ ([source](https://www.nlm.nih.gov/databases/download/terms_and_conditions_pubmed.html)) _When using NLM Web sites, you may encounter documents, illustrations, photographs, or other content contributed by or licensed from private individuals, companies, or organizations that may be `protected` by U.S. and international copyright laws. You can sometimes tell if content is `copyrighted` if it has the copyright symbol, the name of the copyright holder, or the statement "All rights reserved." However, a copyright notice is not required by law and therefore `not all copyrighted content is necessarily marked in this way`. Transmission or reproduction of copyrighted items (beyond that allowed by fair use as defined in the U.S. copyright laws) requires the `written permission` of the copyright holders._ ([source](https://www.nlm.nih.gov/web_policies.html#copyright))

Download

DONE

The bulk download of PubMed data is available through a [FTP server](https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/). To download everything: ``` wget -m --no-parent --show-progress -o download.log ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/ ``` ➡️ Command executed on 05.10.2021. Downloaded 31 GB in around 2h45. ➡️ Number of articles downloaded: 31,850,052. ➡️ MD5 sums checked (method described below). To check the MD5 sums of all downloaded files: ```python import hashlib import re from pathlib import Path from tqdm import tqdm regex = re.compile("[a-z0-9]{32}$") dirpath = Path("baseline/") for filepath in tqdm(dirpath.glob("*.gz")): with filepath.open("rb") as f: data = f.read() md5 = hashlib.md5(data).hexdigest() md5path = f"{filepath}.md5" with open(md5path) as f: data = f.readline() md5_expected = regex.search(data).group(0) if md5 != md5_expected: print(filepath.stem) ```

Parsing

MERGED

PubMed data conform to a common [XML schema](http://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd). The parsing logic is implemented in #465.

Loading

MERGED

The loading logic is implemented in #468.

EmilieDel commented 3 years ago

Findings about PubMed

ESearch

From https://dataguide.nlm.nih.gov/eutilities/utilities.html#esearch:

ESearch (esearch.fcgi) searches a database and returns a list of unique identifiers (UIDs) for records in that database which meet the search criteria. You can specify the search query, sort results, filter results by date, or combine multiple searches with Boolean AND/OR/NOT by adjusting the parameters. Remember, ESearch only returns UIDs, not full records. To retrieve the full records for each of the UIDs in your result set, consider using the EFetch utility.

import json

import requests

term="neuroscience"
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term={term}&retmax=100000&retmode=json"
response = requests.get(url)
rep = json.loads(response.content.decode())

n_results = int(rep["esearchresult"]["count"])  # retrieve the total number of results
first_ids = rep["esearchresult"]["idlist"]   # retrieve some/all IDs

Notes:

The maximum number of IDs in a request is 10.000. One needs to play with parameters retstart and retmax to retrieve all the results (if the number of results is higher than 10.000).
Some preliminary researches:
- brain term gave 2'104'774 results
- neuron gave 745'445 results
- neuroscience gave 483'713 results

EFetch

From https://dataguide.nlm.nih.gov/eutilities/utilities.html#efetch:

EFetch (efetch.fcgi) returns full data records for a list of unique identifiers (UIDs) in a format specified in the parameters. The list of UIDs is either provided in the parameters, or is retrieved from the History server.

import requests
from defusedxml import ElementTree

id = first_ids[0]
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={id}&retmode=xml"
response = requests.get(url)
rep = response.content.decode()

article_set = ElementTree.fromstring(rep)