CBIIT / nci-ctd2-dashboard

NCI CTD^2 Dashboard
http://ctd2-dashboard.nci.nih.gov/
5 stars 3 forks source link

Search issues for disease synonyms. #188

Closed hermidalc closed 4 years ago

hermidalc commented 8 years ago

Originally reported by: donmon

A little view of my experience as a user, which doesn't seem optimal. I'm sorry that I don't know how to frame this more constructively, and it is possible that this function is still being updated.

Background: I am working on a story from FHCRC-M in which data comes from ovarian serous adenocarcinoma, which is often referred to imprecisely as ovarian cancer or ovarian cancer. Some of these results are posted as observations (not enough, but that's a separate issue), where the disease is listed as ovarian serous adenocarcinoma.

When I search for "ovarian cancer" on the dashboard, the top section, Search, shows 32 possible diseases (broken by default into four pages). NONE of them is ovarian serous adenocarcinoma. In fact none of them has any associated observations. So for this search the top section is not useful and is in fact misleading because it misses a synonym that has associated observations.

Also the first two search results are for centers, which doesn't seem particularly helpful, and does not include FHCRC.

If I instead look at the lower section, there are ten observations listed. However, as far as I can tell only three of them really deal specifically with ovary cancers:

The last one was the one I was looking for, but I would probably have given up before finding it if I did not know it was there.

I do not know how this is all implemented, but I would guess that, in contrast to the genes, the indexing of the disease/tissues/context is so inconsistent that search results are very haphazard.

hermidalc commented 8 years ago

Original comment by: @kcs3

This issue is now just background information for #265.

hermidalc commented 8 years ago

Original comment by: @vdancik

The hierarchy of diseases (actually all tissue samples from the thesaurus) is already present in the background-data load file. Each term has a parent, or few parents in some cases. I will check if that information is loaded in the database and create a related ticket (#265) to display it on the tissue_sample page.

hermidalc commented 8 years ago

Original comment by: donmon

I agree that a truly useful disease search function would have to recognize the hierarchical structure. As it stands there is a lot of opportunity for both false negatives and false positives relative to the searcher's intent. Paul makes sense in saying that it is both important and not necessarily easy.

First question I guess is whether there is a curated list that has the appropriate hierarchical structure.

Incidentally I'm just wrapping up a story that frames the diseases in terms of the TCGA categories, and I've found it frustrating trying to line those up with the NCI disease names on the site. I think I'm going to have to link to both.

hermidalc commented 8 years ago

Original comment by: @paulclemons

The issue here is about ontology semantics.

It is an important one, but possibly not super easy to fix.

Right now, we have the 'diseases' syntactically as a flat list of exact-match terms. However, semantically, they are really a hierarchy.

For example, all 'ovarian serous adenocarcinoma' are 'ovary', but not all 'ovary' are 'ovarian serous adenocarcinoma'.

This is the type of modeling issue we should take quite seriously, but not rush to resolve imperfectly. As we move forward, we should try to bring this kind of more-sophisticated semantics into the Dashboard model in some way. Lineage will not be the only case where this comes up.

hermidalc commented 8 years ago

Original comment by: @vdancik

I am observing a strange behavior with search engine:

  1. When searching for "ovarian serous adenocarcinoma", I get 49 entries and "ovarian serous adenocarcinoma" is second with 51 observations.
  2. When searching only for "ovarian", I also get 49 entries bit "ovarian serous adenocarcinoma" is not among the search results!

I don't think this is correct or expected.

hermidalc commented 8 years ago

Original comment by: @armish

This is a good catch and really helpful comment toward usability. Here is our problem:

If we relax the search terms too much (meaning that we search for any of the terms with the title), then we tend to get too many irrelevant results. This is partly why we disabled the search over synonyms at this phase.

We also now start sorting by the importance of a subject, which should help bringing results at the top if there are observations for them, e.g.: http://cbio.mskcc.org/ctd2-dashboard/#search/ovary

This is something we should think more about and might involve a focused development in the long term to make the search more smarter than it used to be. I am keeping this open, let's keep the conversation going as this is quite important.

Thanks.

kcs3 commented 4 years ago

We have completely overhauled search since this issue was entering. Will close because most problems dealt with.