Closed meti-94 closed 3 years ago
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)
add "use_fast=false" in parameters. more details on this link.
Hi again @meti-94 and @Dhruvit-Chaniyara . I have implemented use_fast=False for the tokenizer as default. Does it fix this particualer issue?
Thanks for the pull request. It looks fine! I will merge.
1 files ±0 1 suites ±0 2m 34s :stopwatch: +10s 18 tests ±0 17 :heavy_check_mark: - 1 0 :zzz: ±0 1 :x: +1
For more details on these failures, see this check.
Results for commit bb24d82a. ± Comparison against base commit 6cf4a616.
Hi, I was finally able to work on the code over the weekend, and I found the cause of the error, which was due to a tokenizer problem. In many languages, including my language (Persian), there are words and characters (abbreviations) that the
tokenize()
method in thetokenizers
class is not able to identify, so in the face of such inputs, an empty list of word pieces is returned. In the next step, theoffsets
array will be expanded (by[1]
) even if no word piece was identified, which eventually leads to errors in the training and evaluation process. For example, the wordۖ
indicates sanctity for religious figures, which is seen in many writings but can not be identified.