Closed adiv5 closed 3 years ago
Do you want to extract fixed feature vectors instead of fine-tuning the model? If so, I recommend checking out the extract_features.py
file in Google's BERT repo to get a sense how to do this. Also check the documentation on their github page under the title Using BERT to extract fixed feature vectors (like ELMo)
. To use clinicalBERT, just replace the bert_config_file
and init_checkpoint
with the clinicalBERT versions.
I will try your suggested way and let you know if it works.For now, Closing the issue
Hi @EmilyAlsentzer ,
I tried to extract features as you suggested but faced with a problem. When I run the original BERT example below everything works fine.
echo 'Who was Jim Henson ? ||| Jim Henson was a puppeteer' > /tmp/input.txt
python extract_features.py \
--input_file=/tmp/input.txt \
--output_file=/tmp/output.jsonl \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
--layers=-1,-2,-3,-4 \
--max_seq_length=128 \
--batch_size=8
I changed the bert_config_file and init_checkpoint part and run the below code.
python extract_features.py --input_file=/tmp/input.txt --output_file=/tmp/output.jsonl --vocab_file=bert_pretrain_output_all_notes_150000/vocab.txt --bert_config_file=bert_pretrain_output_all_notes_150000/bert_config.json --init_checkpoint=bert_pretrain_output_all_notes_150000/model.ckpt --layers=-1,-2,-3,-4 --max_seq_length=128
I took the error message below. I think that the problem is with the init_checkpoint part and I try different names like "model.ckpt", "model.ckpt-150000" ... but none of them work.
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for bert_pretrain_output_all_notes_150000/model.ckpt
So could you please help me to run ClinicalBert to extract features from clinical notes?
Thanks in advance.
Hi @EmilyAlsentzer ,
I tried to extract features as you suggested but faced with a problem. When I run the original BERT example below everything works fine.
echo 'Who was Jim Henson ? ||| Jim Henson was a puppeteer' > /tmp/input.txt python extract_features.py \ --input_file=/tmp/input.txt \ --output_file=/tmp/output.jsonl \ --vocab_file=$BERT_BASE_DIR/vocab.txt \ --bert_config_file=$BERT_BASE_DIR/bert_config.json \ --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ --layers=-1,-2,-3,-4 \ --max_seq_length=128 \ --batch_size=8
I changed the bert_config_file and init_checkpoint part and run the below code.
python extract_features.py --input_file=/tmp/input.txt --output_file=/tmp/output.jsonl --vocab_file=bert_pretrain_output_all_notes_150000/vocab.txt --bert_config_file=bert_pretrain_output_all_notes_150000/bert_config.json --init_checkpoint=bert_pretrain_output_all_notes_150000/model.ckpt --layers=-1,-2,-3,-4 --max_seq_length=128
I took the error message below. I think that the problem is with the init_checkpoint part and I try different names like "model.ckpt", "model.ckpt-150000" ... but none of them work.
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for bert_pretrain_output_all_notes_150000/model.ckpt
So could you please help me to run ClinicalBert to extract features from clinical notes?
Thanks in advance.
Hello,
Did you figure out your error by any chance? Three are three files of checkpoints and not sure which one to use. Could you please guide if you can. Thank you.!!
Hi all,
I just confirmed that the following works:
#!/bin/bash
BERT_MODEL=all_notes_150000
BERT_LOC=PATH/TO/CLINICALBERT/biobert_pretrain_output_${BERT_MODEL}
DATA_LOC=extract_features/vocab.txt
OUTPUT_LOC=extract_features/output_vecs_${BERT_MODEL}
python bert/extract_features.py \
--input_file=$DATA_LOC \
--output_file=$OUTPUT_LOC \
--vocab_file=$BERT_LOC/vocab.txt \
--bert_config_file=$BERT_LOC/bert_config.json \
--init_checkpoint=$BERT_LOC/model.ckpt-150000 \
--layers=-1
Please try that and let me know if you have any trouble.
Hi all,
I just confirmed that the following works:
#!/bin/bash BERT_MODEL=all_notes_150000 BERT_LOC=PATH/TO/CLINICALBERT/biobert_pretrain_output_${BERT_MODEL} DATA_LOC=extract_features/vocab.txt OUTPUT_LOC=extract_features/output_vecs_${BERT_MODEL} python bert/extract_features.py \ --input_file=$DATA_LOC \ --output_file=$OUTPUT_LOC \ --vocab_file=$BERT_LOC/vocab.txt \ --bert_config_file=$BERT_LOC/bert_config.json \ --init_checkpoint=$BERT_LOC/model.ckpt-150000 \ --layers=-1
Please try that and let me know if you have any trouble.
Hello,
This works now. Thank you for clarifying.
One question is that we cant generate embeddings for a "huge text file" that is given as input? Do you suggest to me any alternative way for approaching this task, as we cannot change the config file? Thank you
Can you clarify what you are trying to do when you say you want to generate embeddings for a huge text file? And what errors are you getting?
closing due to inactivity, but feel free to post another issue.
Hello, I would like to know how can Feature Vectors be generated from pandas series containing notes. the notes of a single subject ID are combined as one note and preprocessed according to my requirements. Now I just want to create Embedding vectors for the notes. How can this be done?