BlueBrain / Search

Blue Brain text mining toolbox for semantic search and structured information extraction
https://blue-brain-search.readthedocs.io
GNU Lesser General Public License v3.0
42 stars 11 forks source link

Download / Parse / Load PubMed abstracts #460

Closed jankrepl closed 3 years ago

jankrepl commented 3 years ago

🚀 Feature

Motivation

More articles (just abstracts though)

Pitch

More articles is better

Alternatives

?

Additional context

By PubMed we mean the database of article abstracts + metadata. See https://pubmed.ncbi.nlm.nih.gov/

PubMed® comprises more than 32 million citations for biomedical literature...

pafonta commented 3 years ago

Licence

CHECKED Several `conditions` should be fulfilled to use the data. They are described [here](https://www.nlm.nih.gov/databases/download/terms_and_conditions_pubmed.html). _Please note some PubMed/MEDLINE abstracts may be protected by `copyright`._ ([source](https://www.nlm.nih.gov/databases/download/terms_and_conditions_pubmed.html)) _When using NLM Web sites, you may encounter documents, illustrations, photographs, or other content contributed by or licensed from private individuals, companies, or organizations that may be `protected` by U.S. and international copyright laws. You can sometimes tell if content is `copyrighted` if it has the copyright symbol, the name of the copyright holder, or the statement "All rights reserved." However, a copyright notice is not required by law and therefore `not all copyrighted content is necessarily marked in this way`. Transmission or reproduction of copyrighted items (beyond that allowed by fair use as defined in the U.S. copyright laws) requires the `written permission` of the copyright holders._ ([source](https://www.nlm.nih.gov/web_policies.html#copyright))

Download

DONE The bulk download of PubMed data is available through a [FTP server](https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/). To download everything: ``` wget -m --no-parent --show-progress -o download.log ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/ ``` ➡️ Command executed on 05.10.2021. Downloaded 31 GB in around 2h45. ➡️ Number of articles downloaded: 31,850,052. ➡️ MD5 sums checked (method described below). To check the MD5 sums of all downloaded files: ```python import hashlib import re from pathlib import Path from tqdm import tqdm regex = re.compile("[a-z0-9]{32}$") dirpath = Path("baseline/") for filepath in tqdm(dirpath.glob("*.gz")): with filepath.open("rb") as f: data = f.read() md5 = hashlib.md5(data).hexdigest() md5path = f"{filepath}.md5" with open(md5path) as f: data = f.readline() md5_expected = regex.search(data).group(0) if md5 != md5_expected: print(filepath.stem) ```

Parsing

MERGED PubMed data conform to a common [XML schema](http://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd). The parsing logic is implemented in #465.

Loading

MERGED The loading logic is implemented in #468.
EmilieDel commented 3 years ago

Findings about PubMed

ESearch

From https://dataguide.nlm.nih.gov/eutilities/utilities.html#esearch:

ESearch (esearch.fcgi) searches a database and returns a list of unique identifiers (UIDs) for records in that database which meet the search criteria. You can specify the search query, sort results, filter results by date, or combine multiple searches with Boolean AND/OR/NOT by adjusting the parameters. Remember, ESearch only returns UIDs, not full records. To retrieve the full records for each of the UIDs in your result set, consider using the EFetch utility.

import json

import requests

term="neuroscience"
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term={term}&retmax=100000&retmode=json"
response = requests.get(url)
rep = json.loads(response.content.decode())

n_results = int(rep["esearchresult"]["count"])  # retrieve the total number of results
first_ids = rep["esearchresult"]["idlist"]   # retrieve some/all IDs

Notes:

EFetch

From https://dataguide.nlm.nih.gov/eutilities/utilities.html#efetch:

EFetch (efetch.fcgi) returns full data records for a list of unique identifiers (UIDs) in a format specified in the parameters. The list of UIDs is either provided in the parameters, or is retrieved from the History server.

import requests
from defusedxml import ElementTree

id = first_ids[0]
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={id}&retmode=xml"
response = requests.get(url)
rep = response.content.decode()

article_set = ElementTree.fromstring(rep)

Notes:

pafonta commented 3 years ago

465 et #468 implement what this issue requires. These two pull requests were merged. So, closing this issue.