Provide challenge data to Sage to enable development and testing of the submission infrastructure

tschaffter commented 5 years ago

Background: Sage is taking care of developing the IT infrastructure responsible for:

Pulling participant submissions from Synapse (Docker images)
Run the submissions (training, inference) and push results to Synapse

Task: Provide Sage with the following components to enable the development and testing of the IT infrastructure for the EHR Challenge:

[x] Data (synthetic training and evaluation)
[x] Model (docker image, description of input and output)
[x] Gold standard
- synpuf_clean/evaluate/evaluation_patient_status.csv
[x] Scoring script (could be a dummy script at first that will be replaced later)

According to Tom, we could deploy and test an initial version of the IT infrastructure on Sage AWS instances in 1-2 days once we have received the above components.

tschaffter commented 5 years ago

@trberg @yy6linda Do we know when the above required elements will be ready for us to test the IT infrastructure?

trberg commented 5 years ago

@tschaffter So @yy6linda has submitted quite a few models into the evaluation pipeline that could be used. The "gold standard" data, I thought, was in the synthetic dataset. I'm confused about what this means. Do you need an "answer file"? Like the patient list with 0 and 1 for mortality status? By scoring script, do you just mean taking the predictions and comparing them to the gold standard answers?

yy6linda commented 5 years ago

Do I need to resubmit docker images for IT infrastructure test?

trberg commented 5 years ago

@tschaffter I've uploaded a newer version of the synpuf dataset. I've split this data into a training and validation set and created the "gold standard" file with patient ids and mortality status within 6 months after the end date of the validation set.

trberg commented 5 years ago

@tschaffter The scoring script is just going to be a comparison between the output of the model from @yy6linda, which is being put into /data/predictions/ in the docker and the /evaluate/evaluation_patient_status.csv file. Do you need that written?

tschaffter commented 5 years ago

@trberg Where is the gold standard for the new synpuf validation set? Will this file have two columns, 1) person_id and 2) a column with 0 and 1 whether the person is dead after 6 month?

Here is the structure of the new Synpuf data that I see:

Thomass-MacBook-Pro:data tschaffter$ unzip -e synpuf_train_validate_evaluate.zip 
Archive:  synpuf_train_validate_evaluate.zip
   creating: synpuf_clean/
  inflating: synpuf_clean/.DS_Store  
   creating: __MACOSX/
   creating: __MACOSX/synpuf_clean/
  inflating: __MACOSX/synpuf_clean/._.DS_Store  
   creating: synpuf_clean/evaluate/
  inflating: synpuf_clean/evaluate/evaluation_patient_status.csv  
   creating: synpuf_clean/train/
  inflating: synpuf_clean/train/observation_period.csv  
  inflating: synpuf_clean/train/drug_exposure.csv  
  inflating: synpuf_clean/train/death.csv  
  inflating: synpuf_clean/train/measurement.csv  
  inflating: synpuf_clean/train/condition_occurrence.csv  
  inflating: synpuf_clean/train/visit_occurrence.csv  
  inflating: synpuf_clean/train/person.csv  
  inflating: synpuf_clean/train/observation.csv  
  inflating: synpuf_clean/train/procedure_occurrence.csv  
  inflating: synpuf_clean/visit_occurrence.csv  
   creating: synpuf_clean/validation/
  inflating: synpuf_clean/validation/observation_period.csv  
  inflating: synpuf_clean/validation/drug_exposure.csv  
  inflating: synpuf_clean/validation/death.csv  
  inflating: synpuf_clean/validation/measurement.csv  
  inflating: synpuf_clean/validation/condition_occurrence.csv  
  inflating: synpuf_clean/validation/visit_occurrence.csv  
  inflating: synpuf_clean/validation/person.csv  
  inflating: synpuf_clean/validation/observation.csv  
  inflating: synpuf_clean/validation/procedure_occurrence.csv

@tschaffter The scoring script is just going to be a comparison between the output of the model from @yy6linda, which is being put into /data/predictions/ in the docker and the /evaluate/evaluation_patient_status.csv file. Do you need that written?

Yes. Please also describe how to run this script and what the expected output is.

@yy6linda Given you give Tom and me:

a docker image that takes as input the content of the folder train and outputs a trained model
a docker image (can be the same as above) that takes as input the content of the folder validation and output a prediction file

If you already have such an image, please provide the docker command required to run the containers.

trberg commented 5 years ago

@tschaffter the gold standard is synpuf_clean/evaluate/evaluation_patient_status.csv

@yy6linda can you create a simple script to generate an auc from that file and the file your model outputs? I'm not in a position to do that at the moment

yy6linda commented 5 years ago

@trberg No worries! I will take care of it.

yy6linda commented 5 years ago

@tschaffter I just uploaded a new docker image to the EHR staging platform. The name of the image is keras_0325:v0.1 This image contains two python scripts: train.py: extract features from the omop train folder and train a neural network model basing on the selected features.

infer.py: apply models to validation set and output 3-month mortality risk for patients in the validation set.

To run the image in the container, firstly mount four folders to the container: omop, prediction, model ,data and then run train.sh and infer.sh

use the docker command below:

Step1. docker run --mount type=bind,source="$(pwd)"/omop,target=/app/omop --mount type=bind,source="$(pwd)"/data,target=/app/data --mount type=bind,source="$(pwd)"/prediction,target=/app/prediction --mount type=bind,source="$(pwd)"/model,target=/app/model keras_0325:v0.1 bash "/app/train.sh"

Step2. docker run --mount type=bind,source="$(pwd)"/omop,target=/app/omop --mount type=bind,source="$(pwd)"/data,target=/app/data --mount type=bind,source="$(pwd)"/prediction,target=/app/prediction --mount type=bind,source="$(pwd)"/model,target=/app/model keras_0325:v0.1 bash "/app/infer.sh"

The model(.h5 file)can be found in the model folder and the output csv file is in the prediction folder.

Please let me know if you have any questions.

yy6linda commented 5 years ago

@tschaffter Please use the updated command Step1. docker run --mount type=bind,source="$(pwd)"/omop,target=/app/omop --mount type=bind,source="$(pwd)"/data,target=/app/data --mount type=bind,source="$(pwd)"/prediction,target=/app/prediction --mount type=bind,source="$(pwd)"/model,target=/app/model docker.synapse.org/syn18405992/keras_0325:v0.1 bash "/app/train.sh"

Step2. docker run --mount type=bind,source="$(pwd)"/omop,target=/app/omop --mount type=bind,source="$(pwd)"/data,target=/app/data --mount type=bind,source="$(pwd)"/prediction,target=/app/prediction --mount type=bind,source="$(pwd)"/model,target=/app/model docker.synapse.org/syn18405992/keras_0325:v0.1 bash "/app/infer.sh"

tschaffter commented 5 years ago

Note: Waiting for review from Yao

Download synpuf_train_validate_evaluate.zip and extract
```
synapse get syn18460049
```

Run the train image

docker run -v /synpuf_clean/train:/train:ro
-v /scratch:/scratch:rw
-v /model:/model:rw docker.synapse.org/syn18405992/keras_0326:v0.1 bash "/app/train.sh"

Run the inference image

docker run /synpuf_clean/validation:/infer:ro
-v /scratch:/scratch:rw
-v /output:/output:rw
-v /model:/model:ro
docker.synapse.org/syn18405992/keras_0326:v0.1 bash "/app/infer.sh"

Score the predictions

Download the scoring script
```
synapse get syn18475613
```

Run the scoring script

python ehr_scoring.py --goldstandard <file> --predictions <file>

Upload score to Synapse

Good practice: Processes In Containers Should Not Run As Root

yy6linda commented 5 years ago

@tschaffter I modified the docker image according to the steps you mentioned above and just uploaded the modified image(docker.synapse.org/syn18405992/keras_0326:v0.1 ) to EHR Challenge - staging.

tschaffter commented 5 years ago

@thomasyu888 I have compiled in my previous post the different components required for us to start putting in place the challenge workflow hook. Do you have bandwidth to start putting it in place? Thanks!

trberg commented 5 years ago

@tschaffter can we close this issue?

data2health / DREAM-Challenge

Provide challenge data to Sage to enable development and testing of the submission infrastructure #8