facebookresearch / Clinical-Trial-Parser

Library for converting clinical trial eligibility criteria to a machine-readable format.
Apache License 2.0
163 stars 58 forks source link

No output when using custom csv file as input #10

Closed iamyihwa closed 4 years ago

iamyihwa commented 4 years ago

Hello, I am trying to use a custom input for the clinical trial parser.

For this, I have created a csv file with 6 columns ('#nct_id', 'title', 'has_us_facility', 'conditions', 'eligibility_criteria'), and had filled eiligibility_criteia with the input I wanted (and the rest with some other random input that has the right type of input) , eligibility criteria had the format: Inclusion Criteria: - (CRITERION1) - (CRITERION2) .. I have also changed the format of the input to include Then changed the input, output file names at script>ie_parse.sh and cfg_parse.sh .

However it doesn't seem to be able to detect anything (e.g. entities, relationships, etc.) I have also tried to replace eligibility criteria simply by copying an existing one from one from clinical_trials.csv , however still it didn't work.

I have been looking at the code, but didn't see any place where i could change things for the custom input.

salkola commented 4 years ago

The parser tool assumes AACT formatting of the eligibility criteria. In particular, criteria are assumed to be separated by two newlines. This can be changed by changing reCriteriaSplitter.

The inclusion and exclusion criteria sections are assumed to begin with headings like "eligibility criteria:" and "exclusion criteria:". Some variability is tolerated.

You can tailor the criteria extraction to your needs by changing parse_criteria.go.

iamyihwa commented 4 years ago

Thanks for your reply. However it doesn't seem to be due to formats that you have mentioned (inclusion criteria, \n \n, etc.)

As a test, I have copied one row from the clinical_trials.csv and created a new dataframe, saved to csv, and gave this as input (changed the name of input, output files in cfg_parse.sh and ie_parse.sh ) .

However still doesn't detect any entities nor any relationships. The content of the new dataframe. image

I am attaching the screenshots of the outputs of running the script. image image

salkola commented 4 years ago

Criteria: 0 in your log output line main.go:151] Studies: 2, Criteria: 0, Parsed criteria: 0, Relations: 0, Relations per criteria: NaN% tells us that the CFG parser extracted no criteria from the eligibility_criteria input fields. The same applies to the IE parser because they share code for extracting inclusion and exclusion criteria from eligibility criteria text.

Either df.to_csv would need to be changed to match the parsers' criteria extraction or the criteria extraction would need to be changed to match the output of df.to_csv.

As an example, I ingested NCT04346355 to clinical_trials.csv, which I used as an input to the CFG and IE parsers. The CFG parser gave: Studies: 1, Criteria: 22, Parsed criteria: 8, Relations: 8, Relations per criteria: 36.4%. The modified ingest script and the outputs are attached below.

NCT04346355.zip

iamyihwa commented 4 years ago

Thanks for the input. It seems the command I used to create the csv file (pandas .to_csv) was indeed containing extra comma due to not setting (index=False, by default it is set to True), so after doing that the issue was solved. Thank you!! Since the issue is solved, I will close the case.

example: ,#nct_id,title,has_us_facility,conditions,eligibility_criteria : result of df.to_csv() -> #nct_id,title,has_us_facility,conditions,eligibility_criteria : result of df.to_csv(index=False)

salkola commented 4 years ago

Thank you for submitting this issue. It helped to improve the logging.

iamyihwa commented 4 years ago

Sure! Thank you for your help in solving the issue @salkola! Although in retrospect, does not look like a big issue, but was close to giving up without help from you! Thanks! Was there a reason you used MeSH over UMLS ? I saw in the code there is also option of using UMLS. Was just curious ..

salkola commented 4 years ago

Good question! MeSH is a convenient vocabulary for linking extracted terms to medical concepts. ClinicalTrials.gov also uses it to index their trials. Because MeSH does not cover all terms that appear in the eligibility criteria, we have looked for other vocabularies. ULMS contains multiple vocabularies and standards, which could increase the NEL quality and matching rate.