dmis-lab / biobert

Bioinformatics'2020: BioBERT: a pre-trained biomedical language representation model for biomedical text mining
http://doi.org/10.1093/bioinformatics/btz682
Other
1.96k stars 454 forks source link

Can't get word embedding #37

Closed happypanda5 closed 5 years ago

happypanda5 commented 5 years ago

Hi, I am trying to get a word embedding vector for BioBERT, and compare it with the word embedding vector I get from BERT.

However, I haven't been successful in running BioBERT.

I have downloaded the weights from release v1.1-pubmed and after unzipping the weights into a folder, I run the following code

`out = open('prepoutput.json', 'w')

import os

os.system('python3 "/content/biobert/extract_features.py" \ --input_file= "/content/biobert/sample_text.txt" \ --vocab_file= "/content/biobert_v1.1_pubmed/vocab.txt" \ --bert_config_file= "/content/biobert_v1.1_pubmed/bert_config.json" \ --init_checkpoint= "/content/biobert_v1.1_pubmed/model.ckpt.index" \ --output_file= "/content/prepoutput.json" ')`

The output is "256" and the file "preoutput.json" is empty.

Please guide me.

Unfortunately, my attempts at converting the weights from Pytorch wasn't successful either.

jhyuklee commented 5 years ago

Hi @happypanda5, Sorry for the late response. Maybe this comment in #23 can help. Thanks.

futong commented 5 years ago

Hi, @jhyuklee , I also would like to get the word embedding. I took your advice https://github.com/dmis-lab/biobert/issues/23#issuecomment-503751089 that I got all word embeddings of a sentence. But we know the same words with different position have different contextual embeddings. If I only get the word embedding, what should I do? Is there only one word per line as input? Or something else? Looking forward to your reply soon.

izuna385 commented 5 years ago

I think you can try out, for example https://github.com/huggingface/pytorch-transformers Give vocab.txt and pytorch-converted BERTs weights, and sentences. You can use BERT's last layer, or avarage vector of 12 + 1 layer, or something else for getting contextualized word embeddings.

jhyuklee commented 5 years ago

Hi @futong, the extract_features.py file gives you embeddings of last 'k' layers defined by the input argument (see https://github.com/dmis-lab/biobert/blob/7a3c96e3fca2129fe74d78f9b707397fedd4cbd9/extract_features.py#L38), and all the position/segment/wordpiece embeddings will be already included in the first layer. Thanks.