ZhiGroup / Med-BERT

Med-BERT, contextualized embedding model for structured EHR data
Apache License 2.0
260 stars 63 forks source link

run_EHRpretraining.py error #13

Closed sceliay closed 1 year ago

sceliay commented 1 year ago

when I run the 'run_EHRpretraining.py' file on the Example data 'data_file.tsv', the following error occurs: ' Invalid argument: {{function_node __inference_tf_data_experimental_map_andbatch_83}} Key: input_ids. Can't parse serialized Example'

Though I solved this problem with adding default_value like "input_ids": tf.FixedLenFeature([max_seq_length], tf.int64, default_value=[0]*max_seq_length),

Then the error occusrs: TypeError: Failed to convert object of type <class 'tensorflow.python.framework.sparse_tensor.SparseTensor'> to Tensor. Contents: SparseTensor(indices=Tensor("DeserializeSparse:0", shape=(?, 2), dtype=int64), values=Tensor("DeserializeSparse:1", shape=(?,), dtype=int32), dense_shape=Tensor("DeserializeSparse:2", shape=(2,), dtype=int64)). Consider casting elements to a supported type.

so, how can I solve this?

hikf3 commented 1 year ago

Hi there! Were you able to solve this issue?

lrasmy commented 1 year ago

I'm not sure if I get the issue correctly.

The example data file is just an illustration of how you need to organize your data, that need to be further preprocessed using the steps described in pretraining code readme https://github.com/ZhiGroup/Med-BERT/tree/master/Pretraining%20Code

The final output from create_BERTpretrain_EHRfeatures.py is what can be used by run_EHRpretraining.py.

Please let me know if I misunderstand your issue or you do have any further questions.

Thanks, Laila

hikf3 commented 1 year ago

Hi Laila,

Since I was trying to run the codes on the dataset you provided in the repo, I had to change the max_seq_length= 16. It did not show any errors later.

However, I had one question regarding the splitting of the pretraining data. Since you split the data into train, validation, and test datasets for pretraining as well, when did you use the validation and test datasets? Is it like you change the do_train=F do_eval=T for running the run_EHRpretraining.py on them?

Sorry If I have not understood it correctly.

Thanks for replying.

lrasmy commented 1 year ago

@hikf3

Those sets are mostly held out for evaluation purposes.

More specifically, the test sets we reported the results against (in the downstream tasks) are from the patients held out as a part of those sets.

Also, you can run the same run_EHRpretraining line, setting the --input_file to the valid set, --do_train=False ( to turn off training) ,and --do_eval=True , to evaluate the latest checkpoints on the validation set during the pretraining phase. But that runs separately and is not an integrated part of the pretraining loop.

This code version mostly followed the same pre-training strategy as the original Bert code on https://github.com/google-research/bert/blob/master/run_pretraining.py with few tweaks.