Closed FrancescoCasalegno closed 2 years ago
First, I downloaded an .xml
file with Descriptor
list
wget https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/desc2021.xml
Code snippet to find term and all children:
from collections import defaultdict
from defusedxml import ElementTree
all_meshs = ElementTree.parse("desc2021.xml")
all_meshs_of_interest = defaultdict(list)
descriptors = all_meshs.findall("./DescriptorRecord")
for d in descriptors:
name = None
tree = None
for element in d:
if element.tag == "DescriptorName":
for el in element:
name = el.text
elif element.tag == "TreeNumberList":
for el in element:
tree = el.text
if tree is not None and name is not None:
if tree.startswith("C10"): # Nervous System Diseases [C10]
all_meshs_of_interest["C10"].append(name)
Results:
print(all_meshs_of_interest["H01.158.610"])
['Cognitive Neuroscience', 'Neuroanatomy', 'Neurobiology', 'Neurosciences']
Some numbers:
Tree ID | Name | n. children |
---|---|---|
A08 | Nervous System | 271 |
C10 | Nervous System Diseases | 241 |
F03 | Mental Disorders | 212 |
D27.505.954.427 | Central Nervous System Agents | 44 |
E04.525 | Neurosurgical Procedures | 35 |
C16.320.400 | Heredodegenerative Disorders, Nervous System | 27 |
E05.629 | Neuroimaging | 13 |
H01.158.610 | Neurosciences | 4 |
G03.185 | Brain Chemistry | 1 |
E04.190 | Deep Brain Stimulation | 1 |
E07.305.076 | Brain-Computer Interfaces | 1 |
Here are the number of articles for which it is possible to identify topics:
Meshes | N articles | % |
---|---|---|
Both journals and article | 165'782 | 12.36 % |
Only journals | 1'082'983 | 80.78 % |
Only articles | 6'922 | 0.6 % |
Nothing | 85'179 | 6.35 % |
Total | 1'340'866 | 100 % |
80'571 of those articles (around 6%) contains at least one of the mesh identified above (through journal or article infos).
Hello,
The following MeSH
terms could also be useful to identify articles of interest:
Methodology
The output has been curated from the following:
Statistics
Interesting papers:
considering only MeSH
terms from articles:
22 %
of elements with MeSH
terms (37,858 on 172,656).3 %
of the articles from PMC OAS Non-Commercial Use Only
(37,858 on 1,340,866).considering both MeSH
terms from articles and
journals:
67 %
of elements with MeSH
terms (131,057 on 195,165).10 %
of the articles from PMC OAS Non-Commercial Use Only
(131,057 on 1,340,866).import html
import json
import pathlib
from tqdm import tqdm
from bluesearch.database.topics import extract_pubmed_id_from_pmc_file
# Load data
pmc_dir = pathlib.Path("/raid/projects/bbs/pmc/non_comm_use/")
# Load journal dictionary (keys: journal title - values: meshes)
with open("/raid/projects/bbs/meshes/all_journals.json") as f:
all_journals = json.load(f)
# Load article dictionary (keys: article pubmed id - values: meshes)
with open("/raid/projects/bbs/meshes/all_articles.json") as f:
all_articles = json.load(f)
# Load interesting meshes
with open("/raid/projects/bbs/meshes/interesting_meshes.json") as f:
int_meshes = json.load(f)
int_meshes = int_meshes["meshes"]
both = []
only_journal = []
only_article = []
nothing = []
interesting_articles = set()
for journal_dir in tqdm(pmc_dir.iterdir()):
if journal_dir.is_dir():
nlm_ta = html.unescape(journal_dir.stem.replace("_", " "))
for article_file in journal_dir.iterdir():
if article_file.is_file():
pubmed_id = extract_pubmed_id_from_pmc_file(article_file)
# Save if given paper has journal meshes or article meshes
if nlm_ta in all_journals and pubmed_id in all_articles:
both.append(article_file)
elif nlm_ta in all_journals:
only_journal.append(article_file)
elif pubmed_id in all_articles:
only_article.append(article_file)
else:
nothing.append(article_file)
# Check if (at least) one of the journal meshes is in the list of interesting meshes
if nlm_ta in all_journals:
journal_meshes = all_journals[nlm_ta]
for mesh in journal_meshes:
if mesh in int_meshes:
interesting_articles.add(article_file)
continue
# Check if (at least) one of the article meshes is in the list of interesting meshes
if pubmed_id in all_articles:
articles_meshes = all_articles[pubmed_id]
for mesh in articles_meshes:
if mesh in int_meshes:
interesting_articles.add(article_file)
continue
Awesome results, thank you so much guys!
I think the approach proposed by @pafonta to find new topics of interest based on the name of a given author is really cool!
I think we can close here this issue, as you guys already covered everything in my opinion. I am just leaving a small points open for quick discussion with you, I am just curious of your opinion.
I tried to see which topics come out with a different author (Schürmann F
) and the following new topics were found:
Number | Name |
---|---|
A11.284.149 | Cell Membrane |
D12.776.543 | Membrane Proteins |
G02.111.150 | Brain Chemistry |
There's no need to re-run the code to check what is the coverage, just saying that maybe with more authors we could potentially found even more topics and further increase our recall...
...so this idea made me think of the SnowBall algorithm for relation extraction: could we, in principle, apply an iterative approach as follows?
a = [a_0]
(e.g. Markram H
).t = [t_1, ..., t_n]
associated to each a_i
in a
.t_i
in t
retrieve all authors a
associated to t_i
and go to 2.Note –There's a risk of semantic drift at each iteration when reaching point 2, especially without manual topics curation; but one could also do that afterwards.
Using an iterative approach is a good idea.
One way to deal with semantic drift, could be to just consider all the authors of all the articles published by Blue Brain. That would give a good baseline. The point to check is how many of these articles have MeSH. If this is a good ratio, the method would give a good baseline without the need to expand to other authors.
By the way, there is a function in the Gist to rank MeSH according to their usage in the articles. It wasn't demonstrated above for simplicity. But it could be definitely used in the previous paragraph to make sense of all returned MeSH.
Closing as the task is done.
Thanks to #432 we have now access to info on the topic of various articles and journals using MeSH.
The goals of this Issue are:
[x] Collect stats on these topics of interest:
The results of this issue could be used in #439 as well, as pointed out by @EmilieDel.