RyanWangZf / PyTrial

PyTrial: A Comprehensive Platform for Artificial Intelligence for Drug Development
https://pytrial.readthedocs.io/en/latest/
BSD 2-Clause "Simplified" License
84 stars 17 forks source link

trial_patient_match - Patient Data Clarification #2

Closed CCranney closed 1 year ago

CCranney commented 1 year ago

Hi Ryan,

I'm working on a project to streamline the process of matching cancer patients interested in engaging in clinical trials to suitable clinical trials that are recruiting (generally through ClinicalTrials.gov). It looks like this is the intent of your trial_patient_match models here, and wanted to clarify a few things. To summarize some of what I'm looking to do, I'm looking to make an app where a patient enters their personal information (including age, gender, cancer type, stage etc.) and a list of potential clinical trials that are recruiting is provided. This is made rather difficult given that the Inclusion Criteria of ClinicalTrials.gov API output, which contains most of the finer details of patient-trial matching, is open text and therefore inconsistent. I'm trying to find a way around that barrier.

I played around with the Google Colab notebooks where you train trial_patient_match models on EHR and ClinicalTrials.gov data. It looks like the patient data primarily focuses on previous patient visits and demographic data. I'm not seeing a field for specific cancer type, or time since first diagnosis, cancer stage etc. Is it assumed that those fields are best indicated by the trial data as opposed to the patient data?

It looks to me that the purpose of the trial_patient_match model trains based on matching patients to the trial they have already been accepted into. Is that correct? That makes sense to me, as it would provide a fairly robust list of patient-to-trial-matches, but am toying with the idea of adapting the model to accept survey data.

RyanWangZf commented 1 year ago

Hi, Thanks for your interest in pytrial!

The setup in this package refers to the literature in patient trial matching, where they utilized EHR patient sequence and demographic to encode patient records to dense embeddings, then match them to trials.

However, it does mean it is a must to always use this information for matching. The implementation of PatientData contains a field x which could be tabular data represented by pd.DataFrame so it offers the flexibility to include all kinds of patient properties if they can be organized like tables.

It is also doable to remove the sequence inputs for PatientData but that will need an adaptation of the existing implementations. Unfortunately, we don't have an implementation using only the tabular data for matching. Nonetheless, that will be rather easy to implement cuz all we need to do is disable the sequence inputs from the existing data class. I will suggest you clone the package to your local environment and try to subclass the trial_patient_match.data.PatientData and trial_patient_match.models to adjust the data format and the model inputs.

Zifeng

CCranney commented 1 year ago

Hi Zifeng,

Thank you so much! By any chance are there specific papers you are looking at in the patient trial matching literature that you could share here? It would help a great deal reading through the code to know what it is specifically based on.

Thank you for the subclass suggestion! I'll probably make a forked repository for that - it would be easier for me to develop, and ultimately can be pulled back here if you like the additions I make.

Caleb

RyanWangZf commented 1 year ago

We have two models implemented in PyTrial: https://pytrial.readthedocs.io/en/latest/pytrial.tasks.trial_patient_match.html

Some other relevant papers:

Dense embeddings

Information extraction

Zifeng

CCranney commented 1 year ago

Perfect, thank you! I'll close this issue to get it out of your issues feed, but may comment on it in the future with additional questions.

CCranney commented 1 year ago

Hi Zifeng,

I have a few follow-up questions. I looked over the papers you provided as well as your demo code for the patient trial matching portion of your program. I don't really have questions about the format for your clinical data - correct me if I'm wrong, but it looks like it was taken straight from ClinicalTrials.gov, where you then applied clinicalBERT to essentially index the inclusion and exclusion criteria to prepare it for matching with patient information.

As for the patient data, I looked at the load_mimic_ehr_sequence function in pytrial.data.demo_data, which you use to create your demo PatientData class in the Colab notebooks. It looks like you return the following values from that function (I describe them for personal note taking here):

load_mimic_ehr_sequence:
    *visit: A list of entries, one for each patient
            > Next Level: List of Visits for that patient
                    > Next Level: Each visit consists of a list of 3 integer lists, likely corresponding to 'order' below. What do these integers mean?
    *voc: unknown. Based on 'order' below.
    *order: A hard-coded list ['diag', 'prod', 'med'], likely diagnosis, procedures and medications.
    *mortality: Boolean for each patient. Whether or not they're alive?
    *feature: age, gender, and ethnicity of each patient.
    *n_num_feature: When numeric features end in the features (currently 1 - age alone)
    *cat_cardinalities: Has total number of categories for each categorical feature. 2 and 15 for gender and ethnicity, respectively.

In summary, my questions on this data are the following:

  1. The demo data appears to have been taken from the MIMIC-III dataset, but I think it went through a preprocessing step that is not found in this repository (referring to the data found in 'PyTrial/demo_data/demo_patient_sequence/ehr/'. What was the preprocessing step?
  2. You frequently use the three categories listed in the order list, such as to categorize values in a visit (order list being ['diag', 'prod', 'med']). Are these categories Diagnoses, Procedures and Medications, respectively?
  3. What do the integers found in each visit signify? (Answering question 1 may answer this question. See What do these integers mean? in the above breakdown of the load_mimic_ehr_sequence function output). Are they relatable to the eligibility criteria of a clinical trial?
  4. Can there be no visits recorded for the program to run, or is that a hard requirement?
  5. What is the voc category?
  6. How are you testing this program?
  7. Possibly in relation to question 5, but do you have the ability to obtain patient data for patients that participated in completed or terminated clinical trials (ie, non-demo training data)? I'm reaching out to a few others to possibly obtain this data, but thought I'd ask you too.

As a final point, one of the primary changes I needed to make to run your demo code was to specify an mps device for BERT, as I have an M1 Mac and cannot seem to get cuda running appropriately. I forked the repository to investigate and implement this change, but haven't generalized the code to check for cuda first, then mps, then cpu. That may be an appropriate change in the future.

Best,

Caleb

RyanWangZf commented 1 year ago

Hi Caleb,

  1. the preprocessing code is not provided in pytrial. I used the code from SafeDrug https://github.com/ycq091044/SafeDrug with some edits.

  2. yes.

  3. it indicates the index of event belonging to 'diag', 'prod', 'med'. It should be available in the vocabular provided in the demo data.

  4. I think visits are required to run the code. Otherwise, we need to modify the model to disable the inputs and processing for visit data, with if/else logics.

  5. Pls see the point 3.

  6. We do not have a structured unit test yet. The testing process is made by writing an example for each model, e.g., like the ones in https://colab.research.google.com/drive/12JK9DCyHvMZuylgWZ6cDDLx8JC6sMait?usp=sharing

  7. some data is available in project data sphere (https://www.projectdatasphere.org/). But it is still challenging to get their properties that decide if they are eligible or just partially eligible to the corresponding trials.

Zifeng