fixing behavior of transformers tokenizer for all chars and words

meti-94 commented 3 years ago

Hi, I was finally able to work on the code over the weekend, and I found the cause of the error, which was due to a tokenizer problem. In many languages, including my language (Persian), there are words and characters (abbreviations) that the tokenize() method in the tokenizers class is not able to identify, so in the face of such inputs, an empty list of word pieces is returned. In the next step, the offsets array will be expanded (by [1]) even if no word piece was identified, which eventually leads to errors in the training and evaluation process. For example, the word ۖ indicates sanctity for religious figures, which is seen in many writings but can not be identified.

Dhruvit-Chaniyara commented 3 years ago

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)

add "use_fast=false" in parameters. more details on this link.

smaakage85 commented 3 years ago

Hi again @meti-94 and @Dhruvit-Chaniyara . I have implemented use_fast=False for the tokenizer as default. Does it fix this particualer issue?

smaakage85 commented 3 years ago

Thanks for the pull request. It looks fine! I will merge.

github-actions[bot] commented 3 years ago

Unit Test Results

  1 files ±0   1 suites ±0 2m 34s :stopwatch: +10s 18 tests ±0 17 :heavy_check_mark: - 1 0 :zzz: ±0 1 :x: +1

For more details on these failures, see this check.

Results for commit bb24d82a. ± Comparison against base commit 6cf4a616.

ebanalyse / NERDA

fixing behavior of transformers tokenizer for all chars and words #15

Unit Test Results