Closed gaurav closed 3 years ago
@jiaola I imagine this is an easy change if this is in fact a bug as Gaurav describes. Let me know if you want me to fix this, or we can wait until he concept identifiers change.
Any updates on this? We'll probably be generating a new CRDC-H model this week or next, so it'd be good to have this fixed if possible before then.
@gaurav Just letting you know I'm still working on this, but this is a little hairier than expected. Under the hood, de-duping the permissible values are more complicated and contain more information than I realized (that doesn't show in the API results). Requires some logic that I did not anticipate.
Dazhi, let me know if you have any insight as to how best to dedupe these. I had originally thought that these were duplicates due to whitespace differences, but that's not necessarily the case. What we have in the case that Gaurav posted is that there are permissible values that are different entities in the DB, but they have varying levels of complexity. In each of the 2 examples above, 1 of the duplicates has a "description" field, and the other of the duplicates has no "description" field.
But there are many more fields that are there:
But when YAMLDumper.dumps(enum)
happens, it looks like it ignores most of these fields; even when some of them are not empty. I looked into the LinkML code and I'm starting to understand how it serializes Enumerations and PermissibleValues, but I wonder what the best way to dedupe this is. I suppose that I should do the deduping after YAMLDumper.dumps(enum)
, since the Enumeration and PermissibleValues classes know what should and shouldn't be dumped. I suppose then my algorithm will be something like:
enum_dump: str = YAMLDumper.dumps(enum)
dct: Dict = yaml.safe_load(enum_dump)
pprint(dct)
{'description': 'Autogenerated Enumeration for CRDC-H Treatment treatment_effect',
'name': 'CRDC-H.Treatment.treatment_effect',
'permissible_values': [
{'text': 'Not Reported'},
{'text': 'Yes'},
{'text': 'Unknown'},
{'text': 'No'},
{'text': 'Complete Necrosis (No Viable Tumor)'},
{'text': 'Incomplete Necrosis (Viable Tumor Present)'},
{'description': 'Unknown', 'text': 'Unknown'},
{'description': 'Not Reported', 'text': 'Not Reported'},
{'text': 'No Known Treatment Effect'},
{'text': 'No Necrosis'}
]}
3.a. Iterate over permissible_values
and remove any entry which is fully contained within another entry, e.g. {'text': 'Unknown'}
is fully contained within {'description': 'Unknown', 'text': 'Unknown'}
, so {'text': 'Unknown'}
does not contain any extra information and we can remove it.
3.b (easier, but more naive): Find all objects w/ more than 1 key in them (e.g. something with 2 keys might have text
and description
). Then, iterate over objs with only 1 key (text
) in them. If there are matches, remove the matched 1-key object and keep the object with more than 1 key. However, while this would be sufficient to cover Gaurav's example, with this method, I think there could be a lot of edge cases duplicates that could slip in.
enum_yml_str: str = yaml.dump(dct)
return Response(content=enum_yml_str, media_type="application/x-yaml")
@jiaola Mostly I'm curious if you agree with 3.a over 3.b, or some other alternative. If you don't have time to review this, no pressure, I'll just stick with 3.a.
@jiaola @gaurav fix for this is done, but one side effect is that the order of the fields has changed for some reason. Before, 'text' came first because I think it was supposed to. But now it looks like the fields are alphabetically ordered. I'm not sure yet why that happens. I assume the order matters, or does it not?
description: Autogenerated Enumeration for CRDC-H Treatment treatment_effect
name: CRDC-H.Treatment.treatment_effect
permissible_values:
- text: 'Yes'
- text: 'No'
- text: Complete Necrosis (No Viable Tumor)
- text: Incomplete Necrosis (Viable Tumor Present)
- description: Unknown
text: Unknown
- description: Not Reported
text: Not Reported
- text: No Known Treatment Effect
- text: No Necrosis
Dazhi mentioned that checking just on text
field would probably be fine, even if description is a little different. But if a lot different, how to detect that (string similarity algorithm?), but that is getting complex.
Fields out of order is not really an issue.
Original Post
The Terminology Service currently returns duplicate permissible values in value-only mode. For example, looking up the enumerations for
CRDC-H.Treatment.treatment_effect
returns two values for "Unknown", one of which has a description. This duplicate doesn't appear ifvalue_only
is set totrue
. So this problem might go away once we move over to concept identifiers, but I wanted to report it now so we remember to check it later.Edit from Joe, 2021/09/22
Example yaml
Here is the result from the link that Gaurav posted:
Related issues
https://github.com/linkml/linkml/issues/1068
Tasks
linkml-runtime
(I'm thinking)