-
项目中没有vocab.txt这个文件,提供的数据集中也没有
```
# config_ner.py
self.vocab_file = '../data/vocab.txt'
```
-
Hi, I encountered a problem when running the code in benchmark/bertret and want to seek your help. It seems that the 'chinese_wwm_pytorch' cannot be found, including all related files (/vocab.txt, /ad…
-
##### **Describe the bug**
Unit test test_issue_1959.py fails when run on a system whose UTC offset is nonzero. For example, this test fails on my system, where the time zone is set to Pacific Stan…
-
We use 500 txt files (including Chinese and English), use WordPieceTrainer, and set vocab_size to 30522, but the json is 32430.
bert_tokenizer = Tokenizer(WordPiece(unk_token="[UNK]",max_input_ch…
-
# 🐛 Bug
## Information
Model I am using (Bert, XLNet ...): Community Models
Language I am using the model on (English, Chinese ...): Multiple different ones
Quite some community models ca…
-
When processing large documents, I usually process sentence by sentence. Then I have numerous `Doc()` objects per document. It'll be great if I could merge those objects into one then serialize/save t…
-
@Steffy-zxf
在使用paddle1.8进行finetune时,系统自动下载的数据集没问题,切换到自定义数据集finetune时,报错,错误如下:
```powershell
Traceback (most recent call last):
File "sequence_label.py", line 187, in
main()
File "se…
-
Hi, thx for repo!
Can I use code for non Chinese languages ?
let's say for Russian text
thx!
-
I am trying to train a custom BertWordPieceTokenizer for the Ukrainian language.
`tokenizer = ByteLevelBPETokenizer(lowercase = True, unicode_normalizer='nfkc')`
`tokenizer.train(
files=pat…
-
Hey there!
I am trying to train a tokenizer with `BertWordPieceTokenizer`.
I use an iterator that gives the text and `tokenizer.train_from_iterator`.
After training the tokenizer I realized tha…