BlueBrain / Search

Blue Brain text mining toolbox for semantic search and structured information extraction
https://blue-brain-search.readthedocs.io
GNU Lesser General Public License v3.0
40 stars 10 forks source link

Identify + Collect stats on topics of interest #464

Closed FrancescoCasalegno closed 2 years ago

FrancescoCasalegno commented 2 years ago

Thanks to #432 we have now access to info on the topic of various articles and journals using MeSH.

The goals of this Issue are:

EmilieDel commented 2 years ago

Preliminary results: Analysis of the MeSH tree

First, I downloaded an .xml file with Descriptor list

wget https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/desc2021.xml

Code snippet to find term and all children:

from collections import defaultdict

from defusedxml import ElementTree

all_meshs = ElementTree.parse("desc2021.xml")
all_meshs_of_interest = defaultdict(list)

descriptors = all_meshs.findall("./DescriptorRecord")
for d in descriptors:
    name = None
    tree = None
    for element in d:
        if element.tag == "DescriptorName":
            for el in element:
                name = el.text
        elif element.tag == "TreeNumberList":
            for el in element:
                tree = el.text

    if tree is not None and name is not None:
        if tree.startswith("C10"):  # Nervous System Diseases [C10] 
            all_meshs_of_interest["C10"].append(name)

Results:

print(all_meshs_of_interest["H01.158.610"])
['Cognitive Neuroscience', 'Neuroanatomy', 'Neurobiology', 'Neurosciences']

Some numbers:

Tree ID Name n. children
A08 Nervous System 271
C10 Nervous System Diseases 241
F03 Mental Disorders 212
D27.505.954.427 Central Nervous System Agents 44
E04.525 Neurosurgical Procedures 35
C16.320.400 Heredodegenerative Disorders, Nervous System 27
E05.629 Neuroimaging 13
H01.158.610 Neurosciences 4
G03.185 Brain Chemistry 1
E04.190 Deep Brain Stimulation 1
E07.305.076 Brain-Computer Interfaces 1
EmilieDel commented 2 years ago

Here are the number of articles for which it is possible to identify topics:

Meshes N articles %
Both journals and article 165'782 12.36 %
Only journals 1'082'983 80.78 %
Only articles 6'922 0.6 %
Nothing 85'179 6.35 %
Total 1'340'866 100 %

80'571 of those articles (around 6%) contains at least one of the mesh identified above (through journal or article infos).

FrancescoCasalegno commented 2 years ago

Reviews

pafonta commented 2 years ago

Hello,

The following MeSH terms could also be useful to identify articles of interest:

table number | name --- | --- A08 | Nervous System A11.650 | Neuroglia A11.671 | Neurons E01.370.376 | Diagnostic Techniques, Neurological E05.393.332 | Gene Expression Profiling E05.599.395.642 | Models, Neurological E05.629 | Neuroimaging G01.358.500.249.277 | Electric Conductivity G02.111.820.850 | Synaptic Transmission G03.493 | Metabolic Networks and Pathways G04.580 | Membrane Potentials G04.835.850 | Synaptic Transmission G05.308 | Gene Expression Regulation G07.265 | Electrophysiological Phenomena G11.561 | Nervous System Physiological Phenomena H01.158.273.180 | Computational Biology L01.313.124 | Computational Biology

Methodology

The output has been curated from the following:

code Import utility functions: ```python # https://gist.github.com/pafonta/162c1b9ec0380e95a017297a707a4d66 from nlm_mesh import * ``` Parse the `MeSH` tree: ```python # wget https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/desc2021.xml mesh_tree: ElementTree = ElementTree.parse("desc2021.xml") ``` Find the articles from Prof. Markram: ```python identifiers = articles_search("Markram H") # 167 articles found ``` Retrieve the metadata of the articles: ```python articles = articles_fetch(identifiers) ``` Find the `MeSH` names for the articles: ```python mesh = articles_mesh(articles, only_major=True) ``` Find the corresponding `MeSH` tree numbers: ```python names = set(mesh) numbers = set(mesh_numbers(mesh_tree, names)) ``` Consider a specific level of the `MeSH` tree. Find the corresponding `MeSH` names: ```python level = 2 # In the MesSH tree, starting at 0. limit = (3 + 4 * level) truncated = {x[:limit] for x in numbers} mapping = dict(mesh_names(mesh_tree, truncated)) ``` Display the selected `MeSH` tree numbers and names: ```python for number in sorted(truncated): print(f"{number} | {mapping[number]}") ```

Statistics

Interesting papers:

details With their children, the above table is `629` `MeSH` terms. That's `949` in the case of [here](https://github.com/BlueBrain/Search/issues/464#issuecomment-949373873). Put together, that's `1,194` unique `MeSH` terms. All these `MeSH` terms are considered. The 629 `MeSH` terms have been collected like described [here](https://gist.github.com/pafonta/d33a0d5d849932f8ceab8b711d995497#gistcomment-3965575).
EmilieDel commented 2 years ago

Code to obtain the number of interesting papers:

import html
import json
import pathlib
from tqdm import tqdm

from bluesearch.database.topics import extract_pubmed_id_from_pmc_file 

# Load data
pmc_dir = pathlib.Path("/raid/projects/bbs/pmc/non_comm_use/")

# Load journal dictionary (keys: journal title - values: meshes)
with open("/raid/projects/bbs/meshes/all_journals.json") as f:
    all_journals = json.load(f)
# Load article dictionary (keys: article pubmed id - values: meshes)
with open("/raid/projects/bbs/meshes/all_articles.json") as f:
    all_articles = json.load(f)
# Load interesting meshes
with open("/raid/projects/bbs/meshes/interesting_meshes.json") as f:
    int_meshes = json.load(f)
int_meshes = int_meshes["meshes"]

both = []
only_journal = []
only_article = []
nothing = []
interesting_articles = set()

for journal_dir in tqdm(pmc_dir.iterdir()):
    if journal_dir.is_dir():
        nlm_ta = html.unescape(journal_dir.stem.replace("_", " "))
        for article_file in journal_dir.iterdir():
            if article_file.is_file():
                pubmed_id = extract_pubmed_id_from_pmc_file(article_file)

                # Save if given paper has journal meshes or article meshes
                if nlm_ta in all_journals and pubmed_id in all_articles:
                    both.append(article_file)
                elif nlm_ta in all_journals:
                    only_journal.append(article_file)
                elif pubmed_id in all_articles:
                    only_article.append(article_file)
                else:
                    nothing.append(article_file)

                # Check if (at least) one of the journal meshes is in the list of interesting meshes
                if nlm_ta in all_journals:
                    journal_meshes = all_journals[nlm_ta]
                    for mesh in journal_meshes:
                        if mesh in int_meshes:
                            interesting_articles.add(article_file)
                            continue

                # Check if (at least) one of the article meshes is in the list of interesting meshes
                if pubmed_id in all_articles:
                    articles_meshes = all_articles[pubmed_id]
                    for mesh in articles_meshes:
                        if mesh in int_meshes:
                            interesting_articles.add(article_file)
                            continue
FrancescoCasalegno commented 2 years ago

Awesome results, thank you so much guys!

I think the approach proposed by @pafonta to find new topics of interest based on the name of a given author is really cool!

I think we can close here this issue, as you guys already covered everything in my opinion. I am just leaving a small points open for quick discussion with you, I am just curious of your opinion.


I tried to see which topics come out with a different author (Schürmann F) and the following new topics were found:

Number Name
A11.284.149 Cell Membrane
D12.776.543 Membrane Proteins
G02.111.150 Brain Chemistry

There's no need to re-run the code to check what is the coverage, just saying that maybe with more authors we could potentially found even more topics and further increase our recall...

...so this idea made me think of the SnowBall algorithm for relation extraction: could we, in principle, apply an iterative approach as follows?

  1. Start with a seed author a = [a_0] (e.g. Markram H).
  2. Retrieve (and, if needed, filter out some) topics t = [t_1, ..., t_n] associated to each a_i in a.
  3. For each t_i in t retrieve all authors a associated to t_i and go to 2.

Note –There's a risk of semantic drift at each iteration when reaching point 2, especially without manual topics curation; but one could also do that afterwards.

pafonta commented 2 years ago

Using an iterative approach is a good idea.

One way to deal with semantic drift, could be to just consider all the authors of all the articles published by Blue Brain. That would give a good baseline. The point to check is how many of these articles have MeSH. If this is a good ratio, the method would give a good baseline without the need to expand to other authors.

By the way, there is a function in the Gist to rank MeSH according to their usage in the articles. It wasn't demonstrated above for simplicity. But it could be definitely used in the previous paragraph to make sense of all returned MeSH.

pafonta commented 2 years ago

Closing as the task is done.