Fine-tune on English Corpus

106753004 commented 4 years ago

I use Bert (model and tokenizer) to change K-BERT to the English version K-BERT. However, I got poor scores on the classification tasks. If you have K-BERT code of fine-tuning on English Corpus, could you please release it?

autoliuweijie commented 4 years ago

For English, please use:

Model: https://share.weiyun.com/5hWivED Vocab: https://share.weiyun.com/5gBxBYD

However, there is no English KG file suitable for K-BERT. What KG do you use?

yushengsu-thu commented 4 years ago

Hello, @106753004 @autoliuweijie I also want to implement K-BERT on the English corpus. @autoliuweijie the model you mentioned is Google pre-trained Bert on Wiki right, or have you already done some fine-tuning on it? Indeed I use Google Bert (Englis) as the base model and Wikidata (Download link) KG to fine-tune new K-BERT for classification tasks but fail to get good performance.

Actually, I refered to ERNIE and wondered if K-BERT can incorporate Wikidata KG and fine-tune on the different domain datasets such as TACRED, Open Entity. I extracted triples from KG and tokenized them with Bert tokenizer applying the same way to insert into the sentence. Then, followed the same procedure in the paper. Is there any problem with my implementation?

WenTingTseng commented 4 years ago

Hello, it seems that the vocab file cannot be downloaded

inezvl commented 4 years ago

Hello, it is difficult to download the models if you don't have an account on wechat or QQ. Can you make it accessible without a login? thanks

ankechiang commented 4 years ago

Hello,

Thanks for your sharing! The Model file can be successfully downloaded. Any chance that you could upload the corresponding Vocab file?

Thank you so much!

autoliuweijie commented 4 years ago

Hello,

Thanks for your sharing! The Model file can be successfully downloaded. Any chance that you could upload the corresponding Vocab file?

Thank you so much!

Sorry. I don't know what the reason is that the vocab file we uploaded is considered illegal and has been deleted by the administrator. We are dealing with it and releasing the file as soon as possible.

autoliuweijie commented 4 years ago

Hello, it is difficult to download the models if you don't have an account on wechat or QQ. Can you make it accessible without a login? thanks

Sorry, we are looking for other free network disk storage.

autoliuweijie commented 4 years ago

Hello,

Thanks for your sharing! The Model file can be successfully downloaded. Any chance that you could upload the corresponding Vocab file?

Thank you so much!

you can get the corresponding vocab file from UER project:

https://github.com/dbiir/UER-py/blob/master/models/google_uncased_en_vocab.txt

ankechiang commented 4 years ago

It works. Thanks for clarification!

EdwardBurgin commented 4 years ago

Hey, With regards to english. I extracted some domain specific triples from english dbpedia, so this aspect is covered. I have used a pytorch script to convert cased bert base to the bin file required by uer. I the model loss doesn't decrease however, I see that the a method starts with word level then breaks down to individual characters. Presumably this is for Chinese character level embeddings, is there a version for english WordPieceEncoding, perhaps BytePairEncoding or even whole word? Many thanks and great work!

Jiaxin-Liu-96 commented 3 years ago

Hey, With regards to english. I extracted some domain specific triples from english dbpedia, so this aspect is covered. I have used a pytorch script to convert cased bert base to the bin file required by uer. I the model loss doesn't decrease however, I see that the a method starts with word level then breaks down to individual characters. Presumably this is for Chinese character level embeddings, is there a version for english WordPieceEncoding, perhaps BytePairEncoding or even whole word? Many thanks and great work!

hello，I am a freshman student in this domain, and I also want to apply this model into English corpus. I wish you could have time to give me some advice for few questions. 1.have you solved the problem that use english WordPieceEncoding? 2.I don't know how to extract domain specific triples from domain english dbpedia(such like the domain in computer science),could you give me some advice.

thank you in advance! I am waiting for you reply.

vsrana-ai commented 3 years ago

english dbpedia

Hello, can you share the triples (English) and the Bert model for testing purposes? `Did it finally work?

zhuchenxi commented 3 years ago

I use Bert (model and tokenizer) to change K-BERT to the English version K-BERT. However, I got poor scores on the classification tasks. If you have K-BERT code of fine-tuning on English Corpus, could you please release it?

is the english dataset finally work? Thanks very much.

vishprivenkat commented 1 year ago

Hello, I am a student working on a textual classification task and I m trying to use K-BERT over a dataset which is purely in English. Though I understand the implementation strategies in K-BERT, I am a little lost on how to implement them over a corpus of data that is purely in English. I see that the vocab file shared by @autoliuweijie is somehow not accessible. It would be great if you could give me a sense of direction on where to start.

Thank you

Jiaxin-Liu-96 commented 1 year ago

您好，我已经收到了您的邮件，我会尽快回复！祝您生活愉快！

autoliuweijie / K-BERT

Fine-tune on English Corpus #31