Understanding the pretraining code

ZhiGroup / Med-BERT

Med-BERT, contextualized embedding model for structured EHR data

Apache License 2.0

244 stars 62 forks source link

Understanding the pretraining code #7

Closed Angeela03 closed 2 years ago

Angeela03 commented 2 years ago

Hi, I am trying to replicate this code on a private dataset. I ran this step: "python create_BERTpretrain_EHRfeatures.py --input_file= --output_file='output_file' --vocab_file=--max_predictions_per_seq=1 --max_seq_length=64" to get tf features for each of the training, test and validation samples and obtained the desired outputs. However, I am having problems understanding the next step: "python run_EHRpretraining.py --input_file='output_file' --output_dir= --do_train=True --do_eval=True --bert_config_file=config.json --train_batch_size=32 --max_seq_length=512 --max_predictions_per_seq=1 --num_train_steps=4500000 --num_warmup_steps=10000 --learning_rate=5e-5". Once you get the patient features from the previous step, how do you pass all of the training, test and validation features in this step? How does the input_file look like?

lrasmy commented 2 years ago

Hi Angeela,

For the pretraining step, you just feed in the tf features of the training set, that you get from the previous step that you run successfully.

In other words, the input_file for the run_EHRpretraining.py is the output file from create_BERTpretrain_EHRfeatures.py for the training set only.

Thanks

Angeela03 commented 2 years ago

Hi lrasmy,

If you are only using the training set, what is the purpose of separating the data into validation and test sets in this step?

lrasmy commented 2 years ago

Hi Angeela.

Those sets are mostly held out for evaluation purposes.

More specifically, the test sets we reported the results against (in the downstream tasks) are from the patients held out as a part of those sets.

Also, you can run the same run_EHRpretraining line, setting the --input_file to the valid set, --do_train=False ( to turn off training) ,and --do_eval=True , to evaluate the latest checkpoints on the validation set during the pretraining phase. But that runs separately and is not an integrated part of the pretraining loop.

This code version mostly followed the same pre-training strategy as the original Bert code on https://github.com/google-research/bert/blob/master/run_pretraining.py with few tweaks.

Thanks, Laila

shamoons commented 2 years ago

Where do we get the vocab file?

Angeela03 commented 2 years ago

Hi Laila,

Thank you for the clarification.

-Angeela

Angeela03 commented 2 years ago

@shamoons If you don't have a vocab file, you can create one by specifying NA in the following command: python preprocess_pretrain_data.py <vocab/NA>

shamoons commented 2 years ago

@shamoons If you don't have a vocab file, you can create one by specifying NA in the following command: python preprocess_pretrain_data.py <vocab/NA>

Hmmmm, so for the data file, the tsv example will suffice? What about output_Prefix and subset_size?

lrasmy commented 2 years ago

@shamoons

True : data_File: is the path for the tsv file vocab: as @Angeela03 said, you can use NA to create a new one. output_Prefix: is just a naming prefix that you use to name your output files

: if you need to only preprocess a subset of the data, specify the subset size here (that should be the number of patients to be included), if you set it to 0 it will include all data Also please check the header in preprocess_pretrain_data.py as you can find more details/examples Thanks