cancerDHC / ccdh-terminology-service

CCDH Terminology and Mapping Service
3 stars 4 forks source link

Remove duplicates in value_only mode #27

Closed gaurav closed 3 years ago

gaurav commented 3 years ago

Original Post

The Terminology Service currently returns duplicate permissible values in value-only mode. For example, looking up the enumerations for CRDC-H.Treatment.treatment_effect returns two values for "Unknown", one of which has a description. This duplicate doesn't appear if value_only is set to true. So this problem might go away once we move over to concept identifiers, but I wanted to report it now so we remember to check it later.

Edit from Joe, 2021/09/22

Example yaml

Here is the result from the link that Gaurav posted:

name: CRDC-H.Treatment.treatment_effect
description: Autogenerated Enumeration for CRDC-H Treatment treatment_effect
permissible_values:
- text: 'Yes'
- text: Not Reported  # <---1
- text: Unknown  # <---2
- text: 'No'
- text: Unknown  # <---2
  description: Unknown
- text: Incomplete Necrosis (Viable Tumor Present)
- text: No Necrosis
- text: No Known Treatment Effect
- text: Complete Necrosis (No Viable Tumor)
- text: Not Reported  # <---1
  description: Not Reported

Related issues

https://github.com/linkml/linkml/issues/1068

Tasks

joeflack4 commented 3 years ago

@jiaola I imagine this is an easy change if this is in fact a bug as Gaurav describes. Let me know if you want me to fix this, or we can wait until he concept identifiers change.

gaurav commented 3 years ago

Any updates on this? We'll probably be generating a new CRDC-H model this week or next, so it'd be good to have this fixed if possible before then.

joeflack4 commented 3 years ago

@gaurav Just letting you know I'm still working on this, but this is a little hairier than expected. Under the hood, de-duping the permissible values are more complicated and contain more information than I realized (that doesn't show in the API results). Requires some logic that I did not anticipate.

Dazhi, let me know if you have any insight as to how best to dedupe these. I had originally thought that these were duplicates due to whitespace differences, but that's not necessarily the case. What we have in the case that Gaurav posted is that there are permissible values that are different entities in the DB, but they have varying levels of complexity. In each of the 2 examples above, 1 of the duplicates has a "description" field, and the other of the duplicates has no "description" field.

But there are many more fields that are there:

Screen Shot 2021-09-22 at 5 32 26 PM

But when YAMLDumper.dumps(enum) happens, it looks like it ignores most of these fields; even when some of them are not empty. I looked into the LinkML code and I'm starting to understand how it serializes Enumerations and PermissibleValues, but I wonder what the best way to dedupe this is. I suppose that I should do the deduping after YAMLDumper.dumps(enum), since the Enumeration and PermissibleValues classes know what should and shouldn't be dumped. I suppose then my algorithm will be something like:

  1. enum_dump: str = YAMLDumper.dumps(enum)
  2. dct: Dict = yaml.safe_load(enum_dump) pprint(dct)
    {'description': 'Autogenerated Enumeration for CRDC-H Treatment treatment_effect',
    'name': 'CRDC-H.Treatment.treatment_effect',
    'permissible_values': [
          {'text': 'Not Reported'},
          {'text': 'Yes'},
          {'text': 'Unknown'},
          {'text': 'No'},
          {'text': 'Complete Necrosis (No Viable Tumor)'},
          {'text': 'Incomplete Necrosis (Viable Tumor Present)'},
          {'description': 'Unknown', 'text': 'Unknown'},
          {'description': 'Not Reported', 'text': 'Not Reported'},
          {'text': 'No Known Treatment Effect'},
          {'text': 'No Necrosis'}
    ]}

    3.a. Iterate over permissible_values and remove any entry which is fully contained within another entry, e.g. {'text': 'Unknown'} is fully contained within {'description': 'Unknown', 'text': 'Unknown'}, so {'text': 'Unknown'} does not contain any extra information and we can remove it. 3.b (easier, but more naive): Find all objects w/ more than 1 key in them (e.g. something with 2 keys might have text and description). Then, iterate over objs with only 1 key (text) in them. If there are matches, remove the matched 1-key object and keep the object with more than 1 key. However, while this would be sufficient to cover Gaurav's example, with this method, I think there could be a lot of edge cases duplicates that could slip in.

  3. Get back yaml string of deduped enumeration: enum_yml_str: str = yaml.dump(dct)
  4. Return : return Response(content=enum_yml_str, media_type="application/x-yaml")

@jiaola Mostly I'm curious if you agree with 3.a over 3.b, or some other alternative. If you don't have time to review this, no pressure, I'll just stick with 3.a.

joeflack4 commented 3 years ago

@jiaola @gaurav fix for this is done, but one side effect is that the order of the fields has changed for some reason. Before, 'text' came first because I think it was supposed to. But now it looks like the fields are alphabetically ordered. I'm not sure yet why that happens. I assume the order matters, or does it not?

description: Autogenerated Enumeration for CRDC-H Treatment treatment_effect
name: CRDC-H.Treatment.treatment_effect
permissible_values:
- text: 'Yes'
- text: 'No'
- text: Complete Necrosis (No Viable Tumor)
- text: Incomplete Necrosis (Viable Tumor Present)
- description: Unknown
  text: Unknown
- description: Not Reported
  text: Not Reported
- text: No Known Treatment Effect
- text: No Necrosis
joeflack4 commented 3 years ago

Dazhi mentioned that checking just on text field would probably be fine, even if description is a little different. But if a lot different, how to detect that (string similarity algorithm?), but that is getting complex.

Fields out of order is not really an issue.