Tiiiger / bert_score

BERT score for text generation
MIT License
1.56k stars 211 forks source link

Create my own Model for Sentence Similarity/Automated Scoring #72

Closed dhimasyoga16 closed 4 years ago

dhimasyoga16 commented 4 years ago

I have a wikipedia dump file as my corpus (which is in Indonesian, i've extract it and convert it to .txt) How can i train my corpus file with bert multilingual cased (fine tune) with BERTScore so i can have my own model for specific task such as sentence similarity or automated short answer scoring?

Or maybe i should do this with the original BERT? Thankyou so much in advance.

Tiiiger commented 4 years ago

hi @dhimasyoga16 thank you for your interest in this repo. I am not sure what the question is asking.

Are you asking for how to fine-tune the bert-multilingual model? For that, you need to check the huggingface model to see how to continue training bert-multilingual with the mask language modeling objectives. See https://github.com/huggingface/transformers/tree/master/examples/language-modeling.

If you have already fine-tuned bert-multilingual, you can feed in the model path to --model when calling the score function.

Feel free to follow up with more questions.

dhimasyoga16 commented 4 years ago

Hi, thankyou so much for the quick assist and the answer. I also feel sorry for my hard-to-understand question, my bad.

Can i ask one more question? I've done feature extracting by running the extract_features.py and it generates 17.3GB json file. Can i use that json file as my model? Because i want the BERTScore to analyze Indonesian sentences/texts better.

Thankyou so much once again :)

Tiiiger commented 4 years ago

hi @dhimasyoga16 which extract_features.py file are you talking about? is it in this repo?

if you have precomputed the features, you can modify the code (https://github.com/Tiiiger/bert_score/blob/master/bert_score/utils.py#L253) to load the features instead of computing them again.

dhimasyoga16 commented 4 years ago

Hi, sorry for the inactivity of this issue. Refer to this link : https://github.com/huggingface/transformers/tree/master/examples/language-modeling

How can i fine-tune the bert multilingual model in Indonesian language? Can i use a wikipedia dump file as a corpus?

Tiiiger commented 4 years ago

hi @dhimasyoga16 this is a question that is better posed to the huggingface repo. Hopefully they will have a detailed instruction.

We are not really experts on this topic.

dhimasyoga16 commented 4 years ago

Hi, i've successfully created my language model using huggingface transformers.

When i'm trying to do a testing (using my own model of course), why does the --num_layers affect the score? For example, when i'm using --num_layers 2 it gave me lower score rather than when i'm using --num_layers 6. Any detailed explanation about this? And which --num_layers that will give better scoring accuracy?

Sorry for so much questions, i'm new to NLP.

Tiiiger commented 4 years ago

hi @dhimasyoga16 please see our paper for the effect of using different --num_layers. basically, this argument controls which pre-trained layer of representations you are using. It is hard to say which --num_layers would work best for you application without seeing any validation data from our side.