Closed pnrobinson closed 3 years ago
So far, have written a short program to download the gzipped xml files from NCBI FTP site. Have only downloaded a few files as a test. This is checked in on the develop branch. Will run the program on Sumner to download all the files to the Robinson Lab's shared disk space.
Need to figure out how to exploit the tree structure of MeSH to recognize sub-concepts of Protein Kinase (D011494) without having to include as a search term every protein kinase code that is a sub-concept of D011494.
Have downloaded the 1015 gzipped files of PubMed abstracts to /projects/robinson-lab/PMP/pubmed directory on sumner.
Implemented a function that takes as input a MeSH descriptor id such as 'D009369' (Neoplasms) and returns a dictionary containing all the descendants of 'D009369'. The dictionary maps the MeSH id to the label for each descriptor. Relies on a CONSTRUCT query to the NLM SPARQL endpoint. I chose a CONSTRUCT query because there is a limit of 1000 on the size of the return set for a SELECT query. In practice I think it's unlikely that we would exceed 1000 descendants for a single MeSH descriptor --- Neoplasms has 693 subheadings while Pathological Conditions, Signs and Symptoms (D013568) has 859.
Filtering abstracts by MeSH descriptor ids is now implemented in scripts/filter_abstracts.py. I'm going to add publication date to the info extracted from the .xml.gz files we download from NCBI.
Naive filtering is complete for MeSH descriptors ( pr #20 ) but articles that have only keywords and no MeSH descriptors are labeled irrelevant regardless of what the keywords are.
I would suggest for the articles that have only keywords that we use the MeSH descriptors and their synonyms. It will not matter for the purposes of downstream analysis if we miss a few articles, and it is not clear to me that we can do much better at this time, but if there are any ideas for other approaches let's discuss!
I have figured out where to find synonyms for MeSH descriptors, it was not obvious (they are labels/altLabels of the term(s) associated with the descriptor). I'll see what I can do with comparing MeSH synonyms to article keywords. I also explored the biopython library that provides an easy-to-use interface to the Entrez API. This would be a different way to bypass the keyword issue and rely on PubMed's search function to identify relevant articles, which would probably be more effective than comparing keywords to synonyms.
@pnrobinson @vidarmehr I have implemented a straightforward but rather rudimentary filtering of PubMed abstracts by keyword. I first retrieve the preferred label and any synonyms for each MeSH descriptor in the search set (= the descriptors specified by the user and all their descendants in the classification tree). If an abstract does not match the search on its MeSH descriptors (because it has the wrong descriptors, or none at all) I compare the set of its keywords to the set of labels and synonyms. If they have any strings in common, the abstract is relevant.
The problem is that some MeSH descriptors have a fairly complete set of synonyms and others do not. For example, the keyword 'basal cell carcinoma' matches the synonym set D002280 {'Basal Cell Carcinomas', 'Basal Cell Carcinoma', 'Carcinoma, Basal Cell', 'Carcinomas, Basal Cell'} but 'squamous cell carcinoma' does not match D002294 {'Carcinoma, Squamous Cell', 'Squamous Cell Carcinomas', 'Carcinomas, Squamous Cell'} and 'tumor suppressor genes' does not match D016147 {'Genes, Tumor Suppressor', 'Tumor Suppressor Gene', 'Gene, Tumor Suppressor'} although the singular would.
These discrepancies could be remedied by stemming both the keywords and the synonym sets (removing all plurals), but other mis-matches will not be fixed by stemming. For example, the keyword 'breast cancer' occurs frequently but does not match D001943 {'Breast Neoplasms', 'Neoplasm, Breast', 'Breast Neoplasm'} and the same is true for every other major category of cancer, for example 'liver cancer' does not match D008113 {'Liver Neoplasms'} We are ruling out a lot of abstracts that would be relevant for our search if we could only recognize that 'neoplasm' is a manifestation of 'cancer'. Missing the keywords is more significant for recent articles because these are less likely to have MeSH descriptors. The percentage of abstracts that have keywords but no descriptors hit 40% in 2019, and continued to rise from there.
Filtering of PubMed abstracts is implemented in filter_abstracts.py
The PubMed XML format has fields for Mesh Headers (See the XML format here https://www.ncbi.nlm.nih.gov/pubmed/12471242?report=xml&format=text)
In this case, one of the relevant headings is:
Look at this MeSH term in this browser: https://meshb.nlm.nih.gov/record/ui?ui=D020930 we can see it is a child of
Protein kinases
https://meshb.nlm.nih.gov/record/ui?ui=D011494So, one way of implementing a naive relevancy filter would be to get all abstracts that mention Protein kinases (or children thereof) together with (set union) abstracts that mention cancer (MeSH term Neoplasms: https://meshb.nlm.nih.gov/record/ui?ui=D009369).
One complication is that the very newest Abstracts (i.e., just appeared in PubMed) do not have MeSH headers yet. I think we can just use simple text matching of the
NotNLM
field. Here is an example from (https://www.ncbi.nlm.nih.gov/pubmed/32092437?report=xml&format=text):