Closed Angeela03 closed 2 years ago
Hi Angeela,
For the pretraining step, you just feed in the tf features of the training set, that you get from the previous step that you run successfully.
In other words, the input_file for the run_EHRpretraining.py is the output file from create_BERTpretrain_EHRfeatures.py for the training set only.
Thanks
Hi lrasmy,
If you are only using the training set, what is the purpose of separating the data into validation and test sets in this step?
Hi Angeela.
Those sets are mostly held out for evaluation purposes.
More specifically, the test sets we reported the results against (in the downstream tasks) are from the patients held out as a part of those sets.
Also, you can run the same run_EHRpretraining line, setting the --input_file to the valid set, --do_train=False ( to turn off training) ,and --do_eval=True , to evaluate the latest checkpoints on the validation set during the pretraining phase. But that runs separately and is not an integrated part of the pretraining loop.
This code version mostly followed the same pre-training strategy as the original Bert code on https://github.com/google-research/bert/blob/master/run_pretraining.py with few tweaks.
Thanks, Laila
Where do we get the vocab file?
Hi Laila,
Thank you for the clarification.
-Angeela
@shamoons If you don't have a vocab file, you can create one by specifying NA in the following command: python preprocess_pretrain_data.py
@shamoons If you don't have a vocab file, you can create one by specifying NA in the following command: python preprocess_pretrain_data.py
<vocab/NA>
Hmmmm, so for the data file, the tsv
example will suffice? What about output_Prefix
and subset_size
?
@shamoons
True : data_File: is the path for the tsv file vocab: as @Angeela03 said, you can use NA to create a new one. output_Prefix: is just a naming prefix that you use to name your output files
Hi, I am trying to replicate this code on a private dataset. I ran this step: "python create_BERTpretrain_EHRfeatures.py --input_file= --output_file='output_file' --vocab_file=--max_predictions_per_seq=1 --max_seq_length=64" to get tf features for each of the training, test and validation samples and obtained the desired outputs. However, I am having problems understanding the next step: "python run_EHRpretraining.py --input_file='output_file' --output_dir= --do_train=True --do_eval=True --bert_config_file=config.json --train_batch_size=32 --max_seq_length=512 --max_predictions_per_seq=1 --num_train_steps=4500000 --num_warmup_steps=10000 --learning_rate=5e-5". Once you get the patient features from the previous step, how do you pass all of the training, test and validation features in this step? How does the input_file look like?