EmilyAlsentzer / clinicalBERT

repository for Publicly Available Clinical BERT Embeddings
MIT License
673 stars 135 forks source link

Using pretrained clinicalBert model for extracting word/sentence or whole clinical note representation #15

Closed bbardakk closed 3 years ago

bbardakk commented 4 years ago

Hi @EmilyAlsentzer,

I tried to extract features as you suggested but faced with a problem. When I run the original BERT example below everything works fine.

echo 'Who was Jim Henson ? ||| Jim Henson was a puppeteer' > /tmp/input.txt

python extract_features.py \
  --input_file=/tmp/input.txt \
  --output_file=/tmp/output.jsonl \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --layers=-1,-2,-3,-4 \
  --max_seq_length=128 \
  --batch_size=8

I changed the bert_config_file and init_checkpoint part and run the below code.

python extract_features.py --input_file=/tmp/input.txt --output_file=/tmp/output.jsonl --vocab_file=bert_pretrain_output_all_notes_150000/vocab.txt --bert_config_file=bert_pretrain_output_all_notes_150000/bert_config.json --init_checkpoint=bert_pretrain_output_all_notes_150000/model.ckpt --layers=-1,-2,-3,-4 --max_seq_length=128

I took the error message below. I think that the problem is with the init_checkpoint part and I try different names like "model.ckpt", "model.ckpt-150000" ... but none of them work.

tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for bert_pretrain_output_all_notes_150000/model.ckpt

So could you please help me to run ClinicalBert to extract features from clinical notes? Also is it possible to use ClinicalBert to extract embeddings of each word in clinical notes ? Thanks in advance.

AishwaryaAllada commented 3 years ago

Hi @EmilyAlsentzer,

I tried to extract features as you suggested but faced with a problem. When I run the original BERT example below everything works fine.

echo 'Who was Jim Henson ? ||| Jim Henson was a puppeteer' > /tmp/input.txt

python extract_features.py \
  --input_file=/tmp/input.txt \
  --output_file=/tmp/output.jsonl \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --layers=-1,-2,-3,-4 \
  --max_seq_length=128 \
  --batch_size=8

I changed the bert_config_file and init_checkpoint part and run the below code.

python extract_features.py --input_file=/tmp/input.txt --output_file=/tmp/output.jsonl --vocab_file=bert_pretrain_output_all_notes_150000/vocab.txt --bert_config_file=bert_pretrain_output_all_notes_150000/bert_config.json --init_checkpoint=bert_pretrain_output_all_notes_150000/model.ckpt --layers=-1,-2,-3,-4 --max_seq_length=128

I took the error message below. I think that the problem is with the init_checkpoint part and I try different names like "model.ckpt", "model.ckpt-150000" ... but none of them work.

tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for bert_pretrain_output_all_notes_150000/model.ckpt

So could you please help me to run ClinicalBert to extract features from clinical notes? Also is it possible to use ClinicalBert to extract embeddings of each word in clinical notes ? Thanks in advance.

Hello,

Did you figure out your error by any chance? Three are three files of checkpoints and not sure which one to use. Could you please guide if you can. Thank you.!!

EmilyAlsentzer commented 3 years ago

This is a duplicate of issue #12. Please refer to that issue.