Closed Stannislav closed 2 years ago
Potential issue: topic labels are not always unique nodes in the MeSH tree. Consider "Cognitive Neuroscience", the MeSH information is here: https://meshb.nlm.nih.gov/record/ui?name=Cognitive%20Neuroscience
We see that this label correspons to two nodes:
Cognitive Neuroscience [F04.096.628.255.500]
Cognitive Neuroscience [H01.158.610.030]
Each of these nodes is part of a separate topic tower:
Behavioral Disciplines and Activities [F04]
Behavioral Sciences [F04.096]
Psychology [F04.096.628]
Cognitive Science [F04.096.628.255]
Cognitive Neuroscience [F04.096.628.255.500]
Natural Science Disciplines [H01]
Biological Science Disciplines [H01.158]
Neurosciences [H01.158.610]
Cognitive Neuroscience [H01.158.610.030]
Right now the only choice we have is to keep all of these topics because we can't distinguish between the two differnt "Cognitive Neuroscience" topics.
It seems to me that if a journal is tagged with "Cognitive Neuroscience" then probably only one of the two is meant. This could be uniquely identified by the corresponding tree number
, in the current case F04.096.628.255.500
and H01.158.610.030
.
Question: at topic extraction time from PMC/Pubmed, do we have access to these tree numbers? If yes then we could narrow down the parent topics to only one of the several possible topic towers.
Some online resources:
The MeSH RDF website lists different ways of accessing data including different APIs and local downloads.
Since we have to resolve parent topics repeatedly it seems more appropriate do download the data and do the resolution locally.
This will automatically ensure we're always working agains the same MeSH version, which can be upgraded in a controlled manner if necessary.
gzip
module.RDFLib
and lightrdf
, but given that we're only interested in a small subset of data that is easy to parse it might be more appropriate to parse by hand.
Summary
MeSH topics are organised in a tree-like ontology. In order to facilitate topic filtering we should not only extract the MeSH topic, but also all its parents in the MeSH ontology.
Details
Consider the following part of the MeSH ontology (ref)
Currently, if a journal is tagged with the MeSH topic
Cognitive Neuroscience
we extractIn this issue we'd like to suggest that instead one should extract all the parent topics as well:
Benefits
Apart from a more comprehensive topic information, the topic filtering will be trivial. E.g., given the filtering rule
TopicRule(pattern="Natural Science")
(see #550) we will match the journal topic list that contains the parents, while we won't match the list without parent topics.