cgmhaicenter / exBERT

Apache License 2.0
57 stars 15 forks source link

How to update vocab.txt #5

Open koushikkonwar opened 3 years ago

koushikkonwar commented 3 years ago

Hi, I am trying to implement this repo in a different domain . I have updated the vocab file by manually appending new vocabs in the pretrained vocab.txt but it doesnt seems to work . What is the correct way of adding the vocabs of my desried domain in the pretrained vocab.txt

taiwen97 commented 3 years ago

Updating the vocabulary manually should be fine, the new words should be at the bottom part of the vocab file. Can you show me more detail of the error? Thanks!

Tushar-Faroque commented 3 years ago

Dear @taiwen97 First of all, thank you all for the work and dedication.

You did use the BERT model tokenizer WordPiece on your biomedical domain corpus and then append those(new vocab) to the exBERT_vocab.txt, is that right? My ques is that, did you do any kind of preprocessing on the biomedical data before doing tokenization? Also, can I use another type of tokenizer that does not use WordPiece type tokenization?

About the data_preprocess.py file, is there any format for the raw text file? I did try a raw text file from Wikipedia using python data_preprocess.py -voc ./exBERT_vocab.txt -ls 128 -dp ./wiki.txt -n_c 5 -rd 1 -sp ./wiki_data.pkl. But it gave me this output [[], []] What am I doing wrong here?

Edit: Regarding the error of getting [[], []], I have fixed it. I was preprocessing the corpus in lower case and that is why it was not finding the start symbol self.start_symbol = 'A-Z\“\'\('