Open Drastic opened 5 years ago
Hi @Drastic,
I am afraid the repo as is does not work well on languages that don't have the same alphabet and word structure as English. I know nothing about Ukranian, but there are some preprocessing and some postprocessing took from the original Google's script for BERT for SQuAD that might be causing this issue.
For instance, the lines below perform a pre-tokenization by running over each character and checking if it is a whitespace before adding the token to the token list: https://github.com/cdqa-suite/cdQA/blob/50e10443bc37b7bd3546465d37059260f29549f0/cdqa/reader/bertqa_sklearn.py#L155-L167
It also has an effect on the function that generates the final answer: https://github.com/cdqa-suite/cdQA/blob/50e10443bc37b7bd3546465d37059260f29549f0/cdqa/reader/bertqa_sklearn.py#L434-L469
I am not sure these are the main problems which such languages, but it seems to me that it might be it.
Hi, @andrelmfarias
It doesn't seems to be a problem as Ukrainian words are whitespaced too. It's just different alphabet (cyryllic) as you've pointed correctly. And 'bert-base-multilingual-cased' includes it.
I even once was troubleshooting errors when reader couldn't find answers in context. For instance:
Could not find answer: 'POP' vs. 'pop'
It was just letters' case issue that I've fixed.
So, looks like reader is able to tokenize Ukrainian. Is there a way to dump and check model's dictionary?
I understand.
I have to point out that we use BertTokenizer in the preprocessing phase and BasicTokenizer in the postprocessing phase:
You can check what is the vocabulary for these tokenizers
Thank you! Looking into it.
Update: @andrelmfarias you gave a good point. Seems like a major tokenization issue with fine-tuning multilingual BERT model. Additional discussion can be found in https://github.com/huggingface/transformers/issues/982. And a separate article: Hallo multilingual BERT, cómo funcionas?
In general, Bert WordPiece tokenizer is good with English. For other languages, the lower it's represented in pre-trained model, the poorer vocabulary would be for it. The best solution is to train language model from scratch.
Hi! Thanks for this repo. I've trained Ukrainian model from adapted Swedish code example https://github.com/cdqa-suite/cdQA/issues/236#issuecomment-523491362.
The output gives me strange results like:
In this example it takes only some part of the word from the context. It actually knows the correct paragraph to search, but the answer is far from usable. Rarely it can give me a whole word or even a sentence.
I tested the tokenizer and it seems working correctly:
Can someone give me a hint where to look to fix it?
Corpus: SQuAD-uk with 30,000+ QA pairs. Bert parameters:
BertProcessor(bert_model='bert-base-multilingual-cased', do_lower_case=False, is_training=True)