How is the metadata being indexed?

biocaddie / prototype_issues

Used to report and track bioCADDIE prototype issues

3 stars 5 forks source link

How is the metadata being indexed? #108

Open readkev opened 8 years ago

readkev commented 8 years ago

I'm hoping for some clarification of how metadata is being indexed. For example, when I search for "mouse", I receive over 800 results from ClinicalTrials.gov. However, several of the results are actually acronyms that have nothing to do with a mouse whatsoever. One result I received was "Minimally Invasive Control of Epistaxis (MICE)". This result has nothing to do with mouse species, so how are the lexical variants of "mouse" being mapped to one another?

I'm also curious how the DATS model that was created is being used on the back end to pull in this metadata from different repositories. For example, while DataMed pulls "Organism" metadata from BioProject, it excludes the taxonomy ID and all of the name variations. Is it possible to fix this to include all of the information?

naturalbeau commented 8 years ago

When you search "mouse", we have the terminology server to generate the synonyms of the search query. In this case, we have 26 synonyms of "mouse". "MICE" is one of them. screencapture-datamed-biocaddie-org-expanded-query-php-1465929917749

In our search process, we search the original query OR the synonyms. There is a "Search Detailed" box in the right bottom corner of the search result page shows how the query are constructed.

The reason "Minimally Invasive Control of Epistaxis (MICE)" has been been retrieved is that: "MICE" is matched to a word in the title.

naturalbeau commented 8 years ago

@jgrethe Can you answer the second part of the question?

readkev commented 8 years ago

@naturalbeau thanks for the first part of the answer. The problem with that strategy is that if I'm searching for datasets about "mice", I shouldn't be retrieving articles about the cauterization of nosebleeds -- this potentially leaves an opening for a large number of false positives i.e. any acronym for MICE that does not pertain to "mouse" the animal. I'm also assuming this would be the case for other terms used within the database mapping to irrelevant results. Has there been any discussion of mapping lexical variants like this using something like the Unified Medical Language System, for example?

jgrethe commented 8 years ago

The organism information is captured as part of the metadata if it is available (e.g. PDB) or is set if it is constant across a source (e.g. ClinicalTrials.gov). In addition - if a taxonomy ID is included it should also be pulled into the metadata record. Will have the curators double check bioproject in this case.

naturalbeau commented 8 years ago

We have used UMLS in the terminology server to do the query expansion, which is to get the synonyms of the search query. We are planning to integrate the terminology server to the metadata ingestion pipeline.

tjohnson250 commented 8 years ago

Without much richer and more unified metadata (or considerable NLP efforts), it may only be possible to solve this by using advanced search to search on specific fields. I did the same search for "mouse" at ClinicalTrials.gov and got the same Minimally Invasive Control of Epistaxis (MICE) as the 7th search result.