Identify + Collect stats on topics of interest

FrancescoCasalegno commented 2 years ago

Thanks to #432 we have now access to info on the topic of various articles and journals using MeSH.

The goals of this Issue are:

[x] Inspect the MeSH topics and identify the best candidates for collecting articles on Neuroscience-related topics. Note that it's better to have high recall than high precision in this phase, as we don't want to risk to throw away papers that could turn out to be useful.
[x] Collect stats on these topics of interest:
- how many articles in total do we have in PMC?
- out of those, how many have info on the topic? (w/ and w/o considering journal topic)
- out of those, how many have a topic of interest to us?
- out of those, what is the breakdown of n. of article per topic?
The results of this issue could be used in #439 as well, as pointed out by @EmilieDel.

EmilieDel commented 2 years ago

Preliminary results: Analysis of the MeSH tree

First, I downloaded an .xml file with Descriptor list

wget https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/desc2021.xml

Code snippet to find term and all children:

from collections import defaultdict

from defusedxml import ElementTree

all_meshs = ElementTree.parse("desc2021.xml")
all_meshs_of_interest = defaultdict(list)

descriptors = all_meshs.findall("./DescriptorRecord")
for d in descriptors:
    name = None
    tree = None
    for element in d:
        if element.tag == "DescriptorName":
            for el in element:
                name = el.text
        elif element.tag == "TreeNumberList":
            for el in element:
                tree = el.text

    if tree is not None and name is not None:
        if tree.startswith("C10"):  # Nervous System Diseases [C10] 
            all_meshs_of_interest["C10"].append(name)

Results:

print(all_meshs_of_interest["H01.158.610"])
['Cognitive Neuroscience', 'Neuroanatomy', 'Neurobiology', 'Neurosciences']

Some numbers:

Total number of descriptors: 29917

Tree ID	Name	n. children
A08	Nervous System	271
C10	Nervous System Diseases	241
F03	Mental Disorders	212
D27.505.954.427	Central Nervous System Agents	44
E04.525	Neurosurgical Procedures	35
C16.320.400	Heredodegenerative Disorders, Nervous System	27
E05.629	Neuroimaging	13
H01.158.610	Neurosciences	4
G03.185	Brain Chemistry	1
E04.190	Deep Brain Stimulation	1
E07.305.076	Brain-Computer Interfaces	1

EmilieDel commented 2 years ago

Here are the number of articles for which it is possible to identify topics:

Meshes	N articles	%
Both journals and article	165'782	12.36 %
Only journals	1'082'983	80.78 %
Only articles	6'922	0.6 %
Nothing	85'179	6.35 %
Total	1'340'866	100 %

80'571 of those articles (around 6%) contains at least one of the mesh identified above (through journal or article infos).

FrancescoCasalegno commented 2 years ago

Reviews

[x] @jankrepl
[x] @pafonta
[x] @Stannislav
[x] @FrancescoCasalegno

pafonta commented 2 years ago

Hello,

The following MeSH terms could also be useful to identify articles of interest:

table

number | name --- | --- A08 | Nervous System A11.650 | Neuroglia A11.671 | Neurons E01.370.376 | Diagnostic Techniques, Neurological E05.393.332 | Gene Expression Profiling E05.599.395.642 | Models, Neurological E05.629 | Neuroimaging G01.358.500.249.277 | Electric Conductivity G02.111.820.850 | Synaptic Transmission G03.493 | Metabolic Networks and Pathways G04.580 | Membrane Potentials G04.835.850 | Synaptic Transmission G05.308 | Gene Expression Regulation G07.265 | Electrophysiological Phenomena G11.561 | Nervous System Physiological Phenomena H01.158.273.180 | Computational Biology L01.313.124 | Computational Biology

Methodology

The output has been curated from the following:

code

Import utility functions: ```python # https://gist.github.com/pafonta/162c1b9ec0380e95a017297a707a4d66 from nlm_mesh import * ``` Parse the `MeSH` tree: ```python # wget https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/desc2021.xml mesh_tree: ElementTree = ElementTree.parse("desc2021.xml") ``` Find the articles from Prof. Markram: ```python identifiers = articles_search("Markram H") # 167 articles found ``` Retrieve the metadata of the articles: ```python articles = articles_fetch(identifiers) ``` Find the `MeSH` names for the articles: ```python mesh = articles_mesh(articles, only_major=True) ``` Find the corresponding `MeSH` tree numbers: ```python names = set(mesh) numbers = set(mesh_numbers(mesh_tree, names)) ``` Consider a specific level of the `MeSH` tree. Find the corresponding `MeSH` names: ```python level = 2 # In the MesSH tree, starting at 0. limit = (3 + 4 * level) truncated = {x[:limit] for x in numbers} mapping = dict(mesh_names(mesh_tree, truncated)) ``` Display the selected `MeSH` tree numbers and names: ```python for number in sorted(truncated): print(f"{number} | {mapping[number]}") ```

Statistics

Interesting papers:

considering only MeSH terms from articles:
- 22 % of elements with MeSH terms (37,858 on 172,656).
- 3 % of the articles from PMC OAS Non-Commercial Use Only (37,858 on 1,340,866).
considering both MeSH terms from articles and journals:
- 67 % of elements with MeSH terms (131,057 on 195,165).
- 10 % of the articles from PMC OAS Non-Commercial Use Only (131,057 on 1,340,866).

details

With their children, the above table is `629` `MeSH` terms. That's `949` in the case of [here](https://github.com/BlueBrain/Search/issues/464#issuecomment-949373873). Put together, that's `1,194` unique `MeSH` terms. All these `MeSH` terms are considered. The 629 `MeSH` terms have been collected like described [here](https://gist.github.com/pafonta/d33a0d5d849932f8ceab8b711d995497#gistcomment-3965575).

EmilieDel commented 2 years ago

Code to obtain the number of interesting papers:

import html
import json
import pathlib
from tqdm import tqdm

from bluesearch.database.topics import extract_pubmed_id_from_pmc_file 

# Load data
pmc_dir = pathlib.Path("/raid/projects/bbs/pmc/non_comm_use/")

# Load journal dictionary (keys: journal title - values: meshes)
with open("/raid/projects/bbs/meshes/all_journals.json") as f:
    all_journals = json.load(f)
# Load article dictionary (keys: article pubmed id - values: meshes)
with open("/raid/projects/bbs/meshes/all_articles.json") as f:
    all_articles = json.load(f)
# Load interesting meshes
with open("/raid/projects/bbs/meshes/interesting_meshes.json") as f:
    int_meshes = json.load(f)
int_meshes = int_meshes["meshes"]

both = []
only_journal = []
only_article = []
nothing = []
interesting_articles = set()

for journal_dir in tqdm(pmc_dir.iterdir()):
    if journal_dir.is_dir():
        nlm_ta = html.unescape(journal_dir.stem.replace("_", " "))
        for article_file in journal_dir.iterdir():
            if article_file.is_file():
                pubmed_id = extract_pubmed_id_from_pmc_file(article_file)

                # Save if given paper has journal meshes or article meshes
                if nlm_ta in all_journals and pubmed_id in all_articles:
                    both.append(article_file)
                elif nlm_ta in all_journals:
                    only_journal.append(article_file)
                elif pubmed_id in all_articles:
                    only_article.append(article_file)
                else:
                    nothing.append(article_file)

                # Check if (at least) one of the journal meshes is in the list of interesting meshes
                if nlm_ta in all_journals:
                    journal_meshes = all_journals[nlm_ta]
                    for mesh in journal_meshes:
                        if mesh in int_meshes:
                            interesting_articles.add(article_file)
                            continue

                # Check if (at least) one of the article meshes is in the list of interesting meshes
                if pubmed_id in all_articles:
                    articles_meshes = all_articles[pubmed_id]
                    for mesh in articles_meshes:
                        if mesh in int_meshes:
                            interesting_articles.add(article_file)
                            continue

FrancescoCasalegno commented 2 years ago

Awesome results, thank you so much guys!

I think the approach proposed by @pafonta to find new topics of interest based on the name of a given author is really cool!

I think we can close here this issue, as you guys already covered everything in my opinion. I am just leaving a small points open for quick discussion with you, I am just curious of your opinion.

I tried to see which topics come out with a different author (Schürmann F) and the following new topics were found:

Number	Name
A11.284.149	Cell Membrane
D12.776.543	Membrane Proteins
G02.111.150	Brain Chemistry

There's no need to re-run the code to check what is the coverage, just saying that maybe with more authors we could potentially found even more topics and further increase our recall...

...so this idea made me think of the SnowBall algorithm for relation extraction: could we, in principle, apply an iterative approach as follows?

Start with a seed author a = [a_0] (e.g. Markram H).
Retrieve (and, if needed, filter out some) topics t = [t_1, ..., t_n] associated to each a_i in a.
For each t_i in t retrieve all authors a associated to t_i and go to 2.

Note –There's a risk of semantic drift at each iteration when reaching point 2, especially without manual topics curation; but one could also do that afterwards.

pafonta commented 2 years ago

Using an iterative approach is a good idea.

One way to deal with semantic drift, could be to just consider all the authors of all the articles published by Blue Brain. That would give a good baseline. The point to check is how many of these articles have MeSH. If this is a good ratio, the method would give a good baseline without the need to expand to other authors.

By the way, there is a function in the Gist to rank MeSH according to their usage in the articles. It wasn't demonstrated above for simplicity. But it could be definitely used in the previous paragraph to make sense of all returned MeSH.

pafonta commented 2 years ago

Closing as the task is done.

BlueBrain / Search