EmilyAlsentzer / clinicalBERT

repository for Publicly Available Clinical BERT Embeddings
MIT License
658 stars 134 forks source link

How to create note embeddings from pandas series using pretrained clinicalBERT+Discharge_summaries #12

Closed adiv5 closed 3 years ago

adiv5 commented 4 years ago

Hello, I would like to know how can Feature Vectors be generated from pandas series containing notes. the notes of a single subject ID are combined as one note and preprocessed according to my requirements. Now I just want to create Embedding vectors for the notes. How can this be done?

EmilyAlsentzer commented 4 years ago

Do you want to extract fixed feature vectors instead of fine-tuning the model? If so, I recommend checking out the extract_features.py file in Google's BERT repo to get a sense how to do this. Also check the documentation on their github page under the title Using BERT to extract fixed feature vectors (like ELMo). To use clinicalBERT, just replace the bert_config_file and init_checkpoint with the clinicalBERT versions.

adiv5 commented 4 years ago

I will try your suggested way and let you know if it works.For now, Closing the issue

bbardakk commented 4 years ago

Hi @EmilyAlsentzer ,

I tried to extract features as you suggested but faced with a problem. When I run the original BERT example below everything works fine.

echo 'Who was Jim Henson ? ||| Jim Henson was a puppeteer' > /tmp/input.txt

python extract_features.py \
  --input_file=/tmp/input.txt \
  --output_file=/tmp/output.jsonl \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --layers=-1,-2,-3,-4 \
  --max_seq_length=128 \
  --batch_size=8

I changed the bert_config_file and init_checkpoint part and run the below code.

python extract_features.py --input_file=/tmp/input.txt --output_file=/tmp/output.jsonl --vocab_file=bert_pretrain_output_all_notes_150000/vocab.txt --bert_config_file=bert_pretrain_output_all_notes_150000/bert_config.json --init_checkpoint=bert_pretrain_output_all_notes_150000/model.ckpt --layers=-1,-2,-3,-4 --max_seq_length=128

I took the error message below. I think that the problem is with the init_checkpoint part and I try different names like "model.ckpt", "model.ckpt-150000" ... but none of them work.

tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for bert_pretrain_output_all_notes_150000/model.ckpt

So could you please help me to run ClinicalBert to extract features from clinical notes?

Thanks in advance.

AishwaryaAllada commented 3 years ago

Hi @EmilyAlsentzer ,

I tried to extract features as you suggested but faced with a problem. When I run the original BERT example below everything works fine.

echo 'Who was Jim Henson ? ||| Jim Henson was a puppeteer' > /tmp/input.txt

python extract_features.py \
  --input_file=/tmp/input.txt \
  --output_file=/tmp/output.jsonl \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --layers=-1,-2,-3,-4 \
  --max_seq_length=128 \
  --batch_size=8

I changed the bert_config_file and init_checkpoint part and run the below code.

python extract_features.py --input_file=/tmp/input.txt --output_file=/tmp/output.jsonl --vocab_file=bert_pretrain_output_all_notes_150000/vocab.txt --bert_config_file=bert_pretrain_output_all_notes_150000/bert_config.json --init_checkpoint=bert_pretrain_output_all_notes_150000/model.ckpt --layers=-1,-2,-3,-4 --max_seq_length=128

I took the error message below. I think that the problem is with the init_checkpoint part and I try different names like "model.ckpt", "model.ckpt-150000" ... but none of them work.

tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for bert_pretrain_output_all_notes_150000/model.ckpt

So could you please help me to run ClinicalBert to extract features from clinical notes?

Thanks in advance.

Hello,

Did you figure out your error by any chance? Three are three files of checkpoints and not sure which one to use. Could you please guide if you can. Thank you.!!

EmilyAlsentzer commented 3 years ago

Hi all,

I just confirmed that the following works:

#!/bin/bash
BERT_MODEL=all_notes_150000

BERT_LOC=PATH/TO/CLINICALBERT/biobert_pretrain_output_${BERT_MODEL}
DATA_LOC=extract_features/vocab.txt
OUTPUT_LOC=extract_features/output_vecs_${BERT_MODEL}
python bert/extract_features.py \
--input_file=$DATA_LOC \
--output_file=$OUTPUT_LOC \
--vocab_file=$BERT_LOC/vocab.txt \
--bert_config_file=$BERT_LOC/bert_config.json \
--init_checkpoint=$BERT_LOC/model.ckpt-150000 \
--layers=-1 

Please try that and let me know if you have any trouble.

AishwaryaAllada commented 3 years ago

Hi all,

I just confirmed that the following works:

#!/bin/bash
BERT_MODEL=all_notes_150000

BERT_LOC=PATH/TO/CLINICALBERT/biobert_pretrain_output_${BERT_MODEL}
DATA_LOC=extract_features/vocab.txt
OUTPUT_LOC=extract_features/output_vecs_${BERT_MODEL}
python bert/extract_features.py \
--input_file=$DATA_LOC \
--output_file=$OUTPUT_LOC \
--vocab_file=$BERT_LOC/vocab.txt \
--bert_config_file=$BERT_LOC/bert_config.json \
--init_checkpoint=$BERT_LOC/model.ckpt-150000 \
--layers=-1 

Please try that and let me know if you have any trouble.

Hello,

This works now. Thank you for clarifying.

One question is that we cant generate embeddings for a "huge text file" that is given as input? Do you suggest to me any alternative way for approaching this task, as we cannot change the config file? Thank you

EmilyAlsentzer commented 3 years ago

Can you clarify what you are trying to do when you say you want to generate embeddings for a huge text file? And what errors are you getting?

EmilyAlsentzer commented 3 years ago

closing due to inactivity, but feel free to post another issue.