Open mingyangligithub opened 1 year ago
I think match.py only used to generate synonyms embedding/icd_mimic3_random_sort.json. I have provided it.
You do not need to rerun it, unless you want to train on another dataset with different ICD codes that you need.
There appears to be another source for synonyms besides UMLS's MRCONSO.RRF file (version 2024AA). For example, running preprocess/match.py generates 5 synonyms for E870.9 (Accidental cut, puncture, perforation or hemorrhage during unspecified medical care):
['accidental cut, puncture, perforation or hemorrhage during medical care', 'accidental cut, puncture, perforation, or hemorrhage during medical care', 'accidental cut, puncture, perforation or hemorrhage during unspecified medical care', 'accidental cut, puncture, perforation or hemorrhage during medical care (navigational concept)', 'accidental cut, puncture, perforation or haemorrhage during medical care’]
But embedding/icd_mimic3_random_sort.json file has 26!
['accidental cut, puncture, perforation or haemorrhage during medical care, nos (disorder)', 'accidental cut, puncture, perforation or hemorrhage during medical care (navigational concept)', 'acc cut in med care', 'accidental cut, puncture, perforation, or hemorrhage during medical care', 'surg.accid.-medical care nos', 'accidental cut, puncture, perforation or hemorrhage during medical care, (finding)', 'accidental cut, puncture, perforation or hemorrhage during medical care (finding)', 'accidental cut, puncture, perforation or haemorrhage during medical care', 'accidental cut, puncture, perforation or hemorrhage during medical care, nos (finding)', 'accidental cut, puncture, perforation or hemorrhage during medical care,', "accidental cut, puncture, perforation ,h'ge medical care", 'surg.accid. medical care', 'accidental cut, puncture, perforation or hemorrhage during medical care, nos (navigational concept)', 'accidental cut, puncture, perforation or hemorrhage during medical care, (navigational concept)', 'accidental cut, puncture, perforation or haemorrhage during medical care,', 'accidental cut, puncture, perforation or hemorrhage during medical care', 'accidental cut, puncture, perforation or hemorrhage during medical care, nos', 'acc cut in med care nos', "accid.cut/punct/perf/h'ge-med.", "accid cut,puncture,perf,h'ge medical care", "accid cut,puncture,perf,h'ge - medical care nos", "accidental cut, puncture, perforation ,h'ge - medical care", 'accidental cut, puncture, perforation or haemorrhage during medical care, nos', "accid.cut/punct/perf/h'ge med.", 'accidental cut, puncture, perforation or hemorrhage during unspecified medical care', 'accidental cut, puncture, perforation or haemorrhage during medical care, (disorder)']
One clearly corresponds to the short title, but where do the other ones come from? For example, "accid.cut/punct/perf/h'ge med"?
We use the UMLS 2020AA release
I studied the code in preprocess folder. I could understand how description of the code and synonyms are combined. But I didn't find the result is used in the real training process. Because icd_dict{} generated in generate_data_new.ipynb isn't the one in match.py. Where is the result of match.py used? Or which file import match.py? Do I need to preprocess by myself according to the code in preprocess and then generate new data?
Thanks, Best regards.