facebookresearch / Clinical-Trial-Parser

Library for converting clinical trial eligibility criteria to a machine-readable format.
Apache License 2.0
163 stars 58 forks source link

parsing genomic criteria #7

Closed ethansiegl closed 3 years ago

ethansiegl commented 4 years ago

Is this parser compatible with clinical trials which have genomic eligibility criteria? I tried to run a very simple test trial with the following eligibility criteria but the tool was not able to generate any output.

Inclusion Criteria:
    -  EGFR Mutation

Exclusion Criteria:
    -  TP53 Mutation

Asking because it would be really great if this tool could be used to automatically generate Clinical Trial Markup Language and/or used as part of the MatchMiner platform.

salkola commented 4 years ago

Hi Ethan,

That's a great idea! Thank you for the links.

The parser may not work well on genomic and protein eligibility criteria, because the training data do not have enough relevant samples or the MeSH vocabulary does not capture genomic related concepts.

I run the IE parser on NCT04318938. NEL finds no matches for TP53. NER was able to extract [0.5225585501932858, "tp53"] as a clinical variable from the third inclusion criterion. The NER output is written to data/output/ie_ner_clinical_trials.tsv, which you can see if ie_parse.sh is terminated after NER or the line rm "$NER_FILE" is deleted.

I also tried a few other trials with less success. We may indeed need to augment the training data and tune the NEL thresholds. Note that the MeSH vocabulary can be explored with ./script/search.sh, which is a CLI tool to match individual terms to concepts (try entering mutation or protein).

The word embedding vectors are aware of similarly used genomic words. Using TP53 as an example, here are its nearest neighbors (word, similarity score, frequency):

tp53                  1.000      588
mutated               0.744     1099
mutation              0.741    13165
idh1r132              0.721        5
deletion              0.712     1279
srsf2                 0.708       30
mutations             0.700    10603
bcor                  0.689        5
runx1                 0.680       56
asxl1                 0.680       38
p53                   0.680     1264
brca                  0.676     1032
mutational            0.669      783
sf3b1                 0.666       23
germline              0.663     1297
mutant                0.661     1246
flt3-itd              0.661      243
germ-line             0.654       46
zrsr2                 0.646       12
ptch1                 0.639       22
igv_h                 0.636       10
idh2                  0.634      219
igvh                  0.633       45
non-synonymous        0.632       40
crebbp                0.631        7
brca1                 0.629     1306
etv6                  0.629       21
lkb1                  0.627       27
fbxw7                 0.626       12
dnmt3a                0.624       37
tet2                  0.622       58
ethansiegl commented 4 years ago

ok I see. thanks for the quick reply!

samulezj commented 3 years ago

Thanks Ethan for raising this question - I have similar problem in my project. Any detailed guidance on how to augment training data?

salkola commented 3 years ago

One way to augment training data is to collect criteria that the parser gets wrong or does not recognize. Same or similar criteria could be grouped together and the most frequent criteria are labeled and added to the training data. Repeat.

Another way is to have a list of terms that are deemed important for your project and to determine the parsing quality of criteria that have these terms. Problematic criteria, say, ranked by the occurrence of important terms or by the frequency of similar criteria are then labeled and added to the training data.

Focusing on few treatment areas or specialities will make the problem more manageable. General quality improvements are suggested here.

samulezj commented 3 years ago

Thanks for the speedy reply. Very helpful.

salkola commented 3 years ago

@ethansiegl, if you have made progress on generating new custom concepts for genomics, we can add them to the system. You can either open a pull request or I can do it for you. The column format is <concept name> <synonym> <code>.