Closed happypanda5 closed 5 years ago
Hi @happypanda5, Sorry for the late response. Maybe this comment in #23 can help. Thanks.
Hi, @jhyuklee , I also would like to get the word embedding. I took your advice https://github.com/dmis-lab/biobert/issues/23#issuecomment-503751089 that I got all word embeddings of a sentence. But we know the same words with different position have different contextual embeddings. If I only get the word embedding, what should I do? Is there only one word per line as input? Or something else? Looking forward to your reply soon.
I think you can try out, for example https://github.com/huggingface/pytorch-transformers Give vocab.txt and pytorch-converted BERTs weights, and sentences. You can use BERT's last layer, or avarage vector of 12 + 1 layer, or something else for getting contextualized word embeddings.
Hi @futong,
the extract_features.py
file gives you embeddings of last 'k' layers defined by the input argument (see https://github.com/dmis-lab/biobert/blob/7a3c96e3fca2129fe74d78f9b707397fedd4cbd9/extract_features.py#L38), and all the position/segment/wordpiece embeddings will be already included in the first layer.
Thanks.
Hi, I am trying to get a word embedding vector for BioBERT, and compare it with the word embedding vector I get from BERT.
However, I haven't been successful in running BioBERT.
I have downloaded the weights from release v1.1-pubmed and after unzipping the weights into a folder, I run the following code
`out = open('prepoutput.json', 'w')
import os
os.system('python3 "/content/biobert/extract_features.py" \ --input_file= "/content/biobert/sample_text.txt" \ --vocab_file= "/content/biobert_v1.1_pubmed/vocab.txt" \ --bert_config_file= "/content/biobert_v1.1_pubmed/bert_config.json" \ --init_checkpoint= "/content/biobert_v1.1_pubmed/model.ckpt.index" \ --output_file= "/content/prepoutput.json" ')`
The output is "256" and the file "preoutput.json" is empty.
Please guide me.
Unfortunately, my attempts at converting the weights from Pytorch wasn't successful either.