CCranney commented 1 year ago

Hi Ryan,

I'm working on a project to streamline the process of matching cancer patients interested in engaging in clinical trials to suitable clinical trials that are recruiting (generally through ClinicalTrials.gov). It looks like this is the intent of your trial_patient_match models here, and wanted to clarify a few things. To summarize some of what I'm looking to do, I'm looking to make an app where a patient enters their personal information (including age, gender, cancer type, stage etc.) and a list of potential clinical trials that are recruiting is provided. This is made rather difficult given that the Inclusion Criteria of ClinicalTrials.gov API output, which contains most of the finer details of patient-trial matching, is open text and therefore inconsistent. I'm trying to find a way around that barrier.

I played around with the Google Colab notebooks where you train trial_patient_match models on EHR and ClinicalTrials.gov data. It looks like the patient data primarily focuses on previous patient visits and demographic data. I'm not seeing a field for specific cancer type, or time since first diagnosis, cancer stage etc. Is it assumed that those fields are best indicated by the trial data as opposed to the patient data?

It looks to me that the purpose of the trial_patient_match model trains based on matching patients to the trial they have already been accepted into. Is that correct? That makes sense to me, as it would provide a fairly robust list of patient-to-trial-matches, but am toying with the idea of adapting the model to accept survey data.

RyanWangZf commented 1 year ago

Hi, Thanks for your interest in pytrial!

The setup in this package refers to the literature in patient trial matching, where they utilized EHR patient sequence and demographic to encode patient records to dense embeddings, then match them to trials.

However, it does mean it is a must to always use this information for matching. The implementation of PatientData contains a field x which could be tabular data represented by pd.DataFrame so it offers the flexibility to include all kinds of patient properties if they can be organized like tables.

It is also doable to remove the sequence inputs for PatientData but that will need an adaptation of the existing implementations. Unfortunately, we don't have an implementation using only the tabular data for matching. Nonetheless, that will be rather easy to implement cuz all we need to do is disable the sequence inputs from the existing data class. I will suggest you clone the package to your local environment and try to subclass the trial_patient_match.data.PatientData and trial_patient_match.models to adjust the data format and the model inputs.

Zifeng

CCranney commented 1 year ago

Hi Zifeng,

Thank you so much! By any chance are there specific papers you are looking at in the patient trial matching literature that you could share here? It would help a great deal reading through the code to know what it is specifically based on.

Thank you for the subclass suggestion! I'll probably make a forked repository for that - it would be easier for me to develop, and ultimately can be pulled back here if you like the additions I make.

Caleb

RyanWangZf commented 1 year ago

We have two models implemented in PyTrial: https://pytrial.readthedocs.io/en/latest/pytrial.tasks.trial_patient_match.html

DeepEnroll: Zhang, Xingyao, et al. “DeepEnroll: patient-trial matching with deep embedding and entailment prediction.” Proceedings of The Web Conference 2020.
COMPOSE: Gao, J., et al. (2020, August). COMPOSE: cross-modal pseudo-siamese network for patient trial matching. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 803-812).

Some other relevant papers:

Dense embeddings

Isabel Segura-Bedmar and Pablo Raez. Cohort selection for clinical trials using deep learning models. Journal of the American Medical Informatics Association, 26(11):1181– 1188, 2019
Houssein Dhayne, Rima Kilany, Rafiqul Haque, and Yehia Taher. EMR2Vec: Bridging the gap between patient data and clinical trial. Computers & Industrial Engineering, 156: 107236, 2021
Xiong Liu, Cheng Shi, Uday Deore, Yingbo Wang, Myah Tran, Iya Khalil, and Murthy Devarakonda. A scalable ai approach for clinical trial cohort optimization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 479–489. Springer, 2021.

Information extraction

Chunhua Weng, Xiaoying Wu, Zhihui Luo, Mary Regina Boland, Dimitri Theodoratos, and Stephen B Johnson. EliXR: an approach to eligibility criteria extraction and representation. Journal of the American Medical Informatics Association, 18(Supplement_1):i116–i124, 2011.
Chi Yuan, Patrick B Ryan, Casey Ta, Yixuan Guo, Ziran Li, Jill Hardin, Rupa Makadia, Peng Jin, Ning Shang, Tian Kang, et al. Criteria2Query: a natural language interface to clinical databases for cohort definition. Journal of the American Medical Informatics Association, 26(4):294–305, 2019.
Tian Kang, Shaodian Zhang, Youlan Tang, Gregory W Hruby, Alexander Rusanov, Noémie Elhadad, and Chunhua Weng. EliIE: An open-source information extraction system for clinical trial eligibility criteria. Journal of the American Medical Informatics Association, 24(6):1062–1071, 2017.
Ying Xiong, Weihua Peng, Qingcai Chen, Zhengxing Huang, and Buzhou Tang. A unified machine reading comprehension framework for cohort selection. IEEE Journal of Biomedical and Health Informatics, 26(1):379–387, 2021.

Zifeng

CCranney commented 1 year ago

Perfect, thank you! I'll close this issue to get it out of your issues feed, but may comment on it in the future with additional questions.

CCranney commented 1 year ago

Hi Zifeng,

I have a few follow-up questions. I looked over the papers you provided as well as your demo code for the patient trial matching portion of your program. I don't really have questions about the format for your clinical data - correct me if I'm wrong, but it looks like it was taken straight from ClinicalTrials.gov, where you then applied clinicalBERT to essentially index the inclusion and exclusion criteria to prepare it for matching with patient information.

As for the patient data, I looked at the load_mimic_ehr_sequence function in pytrial.data.demo_data, which you use to create your demo PatientData class in the Colab notebooks. It looks like you return the following values from that function (I describe them for personal note taking here):

load_mimic_ehr_sequence:
    *visit: A list of entries, one for each patient
            > Next Level: List of Visits for that patient
                    > Next Level: Each visit consists of a list of 3 integer lists, likely corresponding to 'order' below. What do these integers mean?
    *voc: unknown. Based on 'order' below.
    *order: A hard-coded list ['diag', 'prod', 'med'], likely diagnosis, procedures and medications.
    *mortality: Boolean for each patient. Whether or not they're alive?
    *feature: age, gender, and ethnicity of each patient.
    *n_num_feature: When numeric features end in the features (currently 1 - age alone)
    *cat_cardinalities: Has total number of categories for each categorical feature. 2 and 15 for gender and ethnicity, respectively.

In summary, my questions on this data are the following:

The demo data appears to have been taken from the MIMIC-III dataset, but I think it went through a preprocessing step that is not found in this repository (referring to the data found in 'PyTrial/demo_data/demo_patient_sequence/ehr/'. What was the preprocessing step?
You frequently use the three categories listed in the order list, such as to categorize values in a visit (order list being ['diag', 'prod', 'med']). Are these categories Diagnoses, Procedures and Medications, respectively?
What do the integers found in each visit signify? (Answering question 1 may answer this question. See What do these integers mean? in the above breakdown of the load_mimic_ehr_sequence function output). Are they relatable to the eligibility criteria of a clinical trial?
Can there be no visits recorded for the program to run, or is that a hard requirement?
What is the voc category?
How are you testing this program?
Possibly in relation to question 5, but do you have the ability to obtain patient data for patients that participated in completed or terminated clinical trials (ie, non-demo training data)? I'm reaching out to a few others to possibly obtain this data, but thought I'd ask you too.

As a final point, one of the primary changes I needed to make to run your demo code was to specify an mps device for BERT, as I have an M1 Mac and cannot seem to get cuda running appropriately. I forked the repository to investigate and implement this change, but haven't generalized the code to check for cuda first, then mps, then cpu. That may be an appropriate change in the future.

Best,

Caleb

RyanWangZf commented 1 year ago

Hi Caleb,

the preprocessing code is not provided in pytrial. I used the code from SafeDrug https://github.com/ycq091044/SafeDrug with some edits.
yes.
it indicates the index of event belonging to 'diag', 'prod', 'med'. It should be available in the vocabular provided in the demo data.
I think visits are required to run the code. Otherwise, we need to modify the model to disable the inputs and processing for visit data, with if/else logics.
Pls see the point 3.
We do not have a structured unit test yet. The testing process is made by writing an example for each model, e.g., like the ones in https://colab.research.google.com/drive/12JK9DCyHvMZuylgWZ6cDDLx8JC6sMait?usp=sharing
some data is available in project data sphere (https://www.projectdatasphere.org/). But it is still challenging to get their properties that decide if they are eligible or just partially eligible to the corresponding trials.

Zifeng

RyanWangZf / PyTrial

trial_patient_match - Patient Data Clarification #2

Dense embeddings

Information extraction