Open readkev opened 8 years ago
When you search "mouse", we have the terminology server to generate the synonyms of the search query. In this case, we have 26 synonyms of "mouse". "MICE" is one of them.
In our search process, we search the original query OR the synonyms. There is a "Search Detailed" box in the right bottom corner of the search result page shows how the query are constructed.
The reason "Minimally Invasive Control of Epistaxis (MICE)" has been been retrieved is that: "MICE" is matched to a word in the title.
@jgrethe Can you answer the second part of the question?
@naturalbeau thanks for the first part of the answer. The problem with that strategy is that if I'm searching for datasets about "mice", I shouldn't be retrieving articles about the cauterization of nosebleeds -- this potentially leaves an opening for a large number of false positives i.e. any acronym for MICE that does not pertain to "mouse" the animal. I'm also assuming this would be the case for other terms used within the database mapping to irrelevant results. Has there been any discussion of mapping lexical variants like this using something like the Unified Medical Language System, for example?
The organism information is captured as part of the metadata if it is available (e.g. PDB) or is set if it is constant across a source (e.g. ClinicalTrials.gov). In addition - if a taxonomy ID is included it should also be pulled into the metadata record. Will have the curators double check bioproject in this case.
We have used UMLS in the terminology server to do the query expansion, which is to get the synonyms of the search query. We are planning to integrate the terminology server to the metadata ingestion pipeline.
Without much richer and more unified metadata (or considerable NLP efforts), it may only be possible to solve this by using advanced search to search on specific fields. I did the same search for "mouse" at ClinicalTrials.gov and got the same Minimally Invasive Control of Epistaxis (MICE) as the 7th search result.
I'm hoping for some clarification of how metadata is being indexed. For example, when I search for "mouse", I receive over 800 results from ClinicalTrials.gov. However, several of the results are actually acronyms that have nothing to do with a mouse whatsoever. One result I received was "Minimally Invasive Control of Epistaxis (MICE)". This result has nothing to do with mouse species, so how are the lexical variants of "mouse" being mapped to one another?
I'm also curious how the DATS model that was created is being used on the back end to pull in this metadata from different repositories. For example, while DataMed pulls "Organism" metadata from BioProject, it excludes the taxonomy ID and all of the name variations. Is it possible to fix this to include all of the information?