Partial words output in prediction

Drastic commented 5 years ago

Hi! Thanks for this repo. I've trained Ukrainian model from adapted Swedish code example https://github.com/cdqa-suite/cdQA/issues/236#issuecomment-523491362.

The output gives me strange results like:

query: Який порядок поновлення ПІН-коду?`
answer, context, score: 
[0]    'ти', 'обрати пункт «Зміна ПІН-коду»; ', 9.99197326140182
[1]    'ти', 'У разі втрати ПІН-коду: ', 9.889656512305903
[2]    ', пі', 'для правильного введення ПІН-коду, після 3-ої невірної спроби введення ПІН-коду Картка блокується Банком. ', 9.203504258360976

In this example it takes only some part of the word from the context. It actually knows the correct paragraph to search, but the answer is far from usable. Rarely it can give me a whole word or even a sentence.

I tested the tokenizer and it seems working correctly:

>>> print(vectorizer.vocabulary_)
{'які': 6, 'існують': 7, 'види': 0, 'страхування': 4, 'як': 5, 'відбувається': 1, 'погашення': 3, 'кредиту': 2}

Can someone give me a hint where to look to fix it?

Corpus: SQuAD-uk with 30,000+ QA pairs. Bert parameters: BertProcessor(bert_model='bert-base-multilingual-cased', do_lower_case=False, is_training=True)

andrelmfarias commented 5 years ago

Hi @Drastic,

I am afraid the repo as is does not work well on languages that don't have the same alphabet and word structure as English. I know nothing about Ukranian, but there are some preprocessing and some postprocessing took from the original Google's script for BERT for SQuAD that might be causing this issue.

For instance, the lines below perform a pre-tokenization by running over each character and checking if it is a whitespace before adding the token to the token list: https://github.com/cdqa-suite/cdQA/blob/50e10443bc37b7bd3546465d37059260f29549f0/cdqa/reader/bertqa_sklearn.py#L155-L167

It also has an effect on the function that generates the final answer: https://github.com/cdqa-suite/cdQA/blob/50e10443bc37b7bd3546465d37059260f29549f0/cdqa/reader/bertqa_sklearn.py#L434-L469

I am not sure these are the main problems which such languages, but it seems to me that it might be it.

Drastic commented 5 years ago

Hi, @andrelmfarias

It doesn't seems to be a problem as Ukrainian words are whitespaced too. It's just different alphabet (cyryllic) as you've pointed correctly. And 'bert-base-multilingual-cased' includes it.

I even once was troubleshooting errors when reader couldn't find answers in context. For instance: Could not find answer: 'POP' vs. 'pop' It was just letters' case issue that I've fixed.

So, looks like reader is able to tokenize Ukrainian. Is there a way to dump and check model's dictionary?

andrelmfarias commented 5 years ago

I understand.

I have to point out that we use BertTokenizer in the preprocessing phase and BasicTokenizer in the postprocessing phase:

https://github.com/cdqa-suite/cdQA/blob/50e10443bc37b7bd3546465d37059260f29549f0/cdqa/reader/bertqa_sklearn.py#L969

https://github.com/cdqa-suite/cdQA/blob/50e10443bc37b7bd3546465d37059260f29549f0/cdqa/reader/bertqa_sklearn.py#L813-L815

You can check what is the vocabulary for these tokenizers

Drastic commented 5 years ago

Thank you! Looking into it.

Update: @andrelmfarias you gave a good point. Seems like a major tokenization issue with fine-tuning multilingual BERT model. Additional discussion can be found in https://github.com/huggingface/transformers/issues/982. And a separate article: Hallo multilingual BERT, cómo funcionas?

In general, Bert WordPiece tokenizer is good with English. For other languages, the lower it's represented in pre-trained model, the poorer vocabulary would be for it. The best solution is to train language model from scratch.

cdqa-suite / cdQA

Partial words output in prediction #294