Closed ethansiegl closed 3 years ago
Hi Ethan,
That's a great idea! Thank you for the links.
The parser may not work well on genomic and protein eligibility criteria, because the training data do not have enough relevant samples or the MeSH vocabulary does not capture genomic related concepts.
I run the IE parser on NCT04318938. NEL finds no matches for TP53. NER was able to extract [0.5225585501932858, "tp53"]
as a clinical variable from the third inclusion criterion. The NER output is written to data/output/ie_ner_clinical_trials.tsv
, which you can see if ie_parse.sh is terminated after NER or the line rm "$NER_FILE"
is deleted.
I also tried a few other trials with less success. We may indeed need to augment the training data and tune the NEL thresholds. Note that the MeSH vocabulary can be explored with ./script/search.sh
, which is a CLI tool to match individual terms to concepts (try entering mutation
or protein
).
The word embedding vectors are aware of similarly used genomic words. Using TP53 as an example, here are its nearest neighbors (word, similarity score, frequency):
tp53 1.000 588
mutated 0.744 1099
mutation 0.741 13165
idh1r132 0.721 5
deletion 0.712 1279
srsf2 0.708 30
mutations 0.700 10603
bcor 0.689 5
runx1 0.680 56
asxl1 0.680 38
p53 0.680 1264
brca 0.676 1032
mutational 0.669 783
sf3b1 0.666 23
germline 0.663 1297
mutant 0.661 1246
flt3-itd 0.661 243
germ-line 0.654 46
zrsr2 0.646 12
ptch1 0.639 22
igv_h 0.636 10
idh2 0.634 219
igvh 0.633 45
non-synonymous 0.632 40
crebbp 0.631 7
brca1 0.629 1306
etv6 0.629 21
lkb1 0.627 27
fbxw7 0.626 12
dnmt3a 0.624 37
tet2 0.622 58
ok I see. thanks for the quick reply!
Thanks Ethan for raising this question - I have similar problem in my project. Any detailed guidance on how to augment training data?
One way to augment training data is to collect criteria that the parser gets wrong or does not recognize. Same or similar criteria could be grouped together and the most frequent criteria are labeled and added to the training data. Repeat.
Another way is to have a list of terms that are deemed important for your project and to determine the parsing quality of criteria that have these terms. Problematic criteria, say, ranked by the occurrence of important terms or by the frequency of similar criteria are then labeled and added to the training data.
Focusing on few treatment areas or specialities will make the problem more manageable. General quality improvements are suggested here.
Thanks for the speedy reply. Very helpful.
@ethansiegl, if you have made progress on generating new custom concepts for genomics, we can add them to the system. You can either open a pull request or I can do it for you. The column format is <concept name> <synonym> <code>
.
Is this parser compatible with clinical trials which have genomic eligibility criteria? I tried to run a very simple test trial with the following eligibility criteria but the tool was not able to generate any output.
Asking because it would be really great if this tool could be used to automatically generate Clinical Trial Markup Language and/or used as part of the MatchMiner platform.