Closed iamyihwa closed 4 years ago
The parser tool assumes AACT formatting of the eligibility criteria. In particular, criteria are assumed to be separated by two newlines. This can be changed by changing reCriteriaSplitter.
The inclusion and exclusion criteria sections are assumed to begin with headings like "eligibility criteria:" and "exclusion criteria:". Some variability is tolerated.
You can tailor the criteria extraction to your needs by changing parse_criteria.go.
Thanks for your reply. However it doesn't seem to be due to formats that you have mentioned (inclusion criteria, \n \n, etc.)
As a test, I have copied one row from the clinical_trials.csv and created a new dataframe, saved to csv, and gave this as input (changed the name of input, output files in cfg_parse.sh and ie_parse.sh ) .
However still doesn't detect any entities nor any relationships. The content of the new dataframe.
I am attaching the screenshots of the outputs of running the script.
Criteria: 0
in your log output line main.go:151] Studies: 2, Criteria: 0, Parsed criteria: 0, Relations: 0, Relations per criteria: NaN%
tells us that the CFG parser extracted no criteria from the eligibility_criteria input fields. The same applies to the IE parser because they share code for extracting inclusion and exclusion criteria from eligibility criteria text.
Either df.to_csv
would need to be changed to match the parsers' criteria extraction or the criteria extraction would need to be changed to match the output of df.to_csv
.
As an example, I ingested NCT04346355 to clinical_trials.csv
, which I used as an input to the CFG and IE parsers. The CFG parser gave: Studies: 1, Criteria: 22, Parsed criteria: 8, Relations: 8, Relations per criteria: 36.4%
. The modified ingest script and the outputs are attached below.
Thanks for the input. It seems the command I used to create the csv file (pandas .to_csv) was indeed containing extra comma due to not setting (index=False, by default it is set to True), so after doing that the issue was solved. Thank you!! Since the issue is solved, I will close the case.
example: ,#nct_id,title,has_us_facility,conditions,eligibility_criteria : result of df.to_csv() -> #nct_id,title,has_us_facility,conditions,eligibility_criteria : result of df.to_csv(index=False)
Thank you for submitting this issue. It helped to improve the logging.
Sure! Thank you for your help in solving the issue @salkola! Although in retrospect, does not look like a big issue, but was close to giving up without help from you! Thanks! Was there a reason you used MeSH over UMLS ? I saw in the code there is also option of using UMLS. Was just curious ..
Good question! MeSH is a convenient vocabulary for linking extracted terms to medical concepts. ClinicalTrials.gov also uses it to index their trials. Because MeSH does not cover all terms that appear in the eligibility criteria, we have looked for other vocabularies. ULMS contains multiple vocabularies and standards, which could increase the NEL quality and matching rate.
Hello, I am trying to use a custom input for the clinical trial parser.
For this, I have created a csv file with 6 columns ('#nct_id', 'title', 'has_us_facility', 'conditions', 'eligibility_criteria'), and had filled eiligibility_criteia with the input I wanted (and the rest with some other random input that has the right type of input) , eligibility criteria had the format: Inclusion Criteria: - (CRITERION1) - (CRITERION2) .. I have also changed the format of the input to include Then changed the input, output file names at script>ie_parse.sh and cfg_parse.sh .
However it doesn't seem to be able to detect anything (e.g. entities, relationships, etc.) I have also tried to replace eligibility criteria simply by copying an existing one from one from clinical_trials.csv , however still it didn't work.
I have been looking at the code, but didn't see any place where i could change things for the custom input.