BlueBrain / Search

Blue Brain text mining toolbox for semantic search and structured information extraction
https://blue-brain-search.readthedocs.io
GNU Lesser General Public License v3.0
42 stars 11 forks source link

`bbs_database topic-extract`: extract parent topics for MeSH #557

Closed Stannislav closed 2 years ago

Stannislav commented 2 years ago

Summary

MeSH topics are organised in a tree-like ontology. In order to facilitate topic filtering we should not only extract the MeSH topic, but also all its parents in the MeSH ontology.

Details

Consider the following part of the MeSH ontology (ref)

Natural Science Disciplines [H01]
    Biological Science Disciplines [H01.158]
        Anatomy [H01.158.100]
        Biochemistry [H01.158.201]
        Biology [H01.158.273]
        Biophysics [H01.158.344]
        Biotechnology [H01.158.550]
        Chronobiology Discipline [H01.158.580]
        Geroscience [H01.158.595]
        Neurosciences [H01.158.610]
            Cognitive Neuroscience [H01.158.610.030] 

Currently, if a journal is tagged with the MeSH topic Cognitive Neuroscience we extract

journal_topics = {
    "MeSH": ["Cognitive Neuroscience"],
}

In this issue we'd like to suggest that instead one should extract all the parent topics as well:

journal_topics = {
    "MeSH": [
        "Cognitive Neuroscience",
        "Neurosciences",
        "Biological Science Disciplines",
        "Natural Science Disciplines",
    ],
}

Benefits

Apart from a more comprehensive topic information, the topic filtering will be trivial. E.g., given the filtering rule TopicRule(pattern="Natural Science") (see #550) we will match the journal topic list that contains the parents, while we won't match the list without parent topics.

Stannislav commented 2 years ago

Potential issue: topic labels are not always unique nodes in the MeSH tree. Consider "Cognitive Neuroscience", the MeSH information is here: https://meshb.nlm.nih.gov/record/ui?name=Cognitive%20Neuroscience

We see that this label correspons to two nodes:

Each of these nodes is part of a separate topic tower:

    Behavioral Disciplines and Activities [F04]
        Behavioral Sciences [F04.096]
            Psychology [F04.096.628]
                Cognitive Science [F04.096.628.255]
                    Cognitive Neuroscience [F04.096.628.255.500]

    Natural Science Disciplines [H01]
        Biological Science Disciplines [H01.158]
            Neurosciences [H01.158.610]
                Cognitive Neuroscience [H01.158.610.030]

Right now the only choice we have is to keep all of these topics because we can't distinguish between the two differnt "Cognitive Neuroscience" topics.

It seems to me that if a journal is tagged with "Cognitive Neuroscience" then probably only one of the two is meant. This could be uniquely identified by the corresponding tree number, in the current case F04.096.628.255.500 and H01.158.610.030.

Question: at topic extraction time from PMC/Pubmed, do we have access to these tree numbers? If yes then we could narrow down the parent topics to only one of the several possible topic towers.

Stannislav commented 2 years ago

Some online resources:

Stannislav commented 2 years ago

The MeSH RDF website lists different ways of accessing data including different APIs and local downloads.

Stannislav commented 2 years ago

Since we have to resolve parent topics repeatedly it seems more appropriate do download the data and do the resolution locally.

This will automatically ensure we're always working agains the same MeSH version, which can be upgraded in a controlled manner if necessary.