-
When training a joint SPM model on two or more languages, is there a way to alleviate the problem of segmenting a token in language1 into subunits seen in language2 causing UNKs during test-time?
I…
-
Thanks a lot for this works.
According to function `ReadWord`-
https://github.com/yumeng5/Spherical-Text-Embedding/blob/master/jose.c#L60
a word is defined as a sequence of characters with some d…
-
Thank you for providing this useful toolkit! I am new to it and is learning it, as I know, in ctc means the continuing the last character, then what does the self-transition mean? Can I treat them as…
-
I tried to run the code as below:
`import sentencepiece as spm`
`spm.set_random_generator_seed(1)`
`spm.SentencePieceTrainer.train('--input=botchan.txt --model_type=bpe --vocab_size=10000 --model_p…
-
The output of the `--alignment` option seems to be an alignment on subword units rather than on the tokens themselves:
```
Hello there ||| Hallo da
0-0 1-1 2-2
Hello ||| HalloHalloHalloHallo
0-…
-
I'm developing a transformer based NMT system for low-resource English-Sinhala translation using a parallel corpus of 54k sentences (vocab size=5k). I experimented with BPE and unigram as subword segm…
-
It seems that special tokens are not respected by BPemb. For instance, "\" gets parsed into multiple subword tokens instead of being caught and assigned the appropriate index. This is true even when i…
-
If I use spm_train and spm_encode, i.e. sentencepiece, in ESPNET.
The dictionary is subword units based or tokens.
Can I use ctc_segmentation directly (like tedlium2 example)?
It seems to be possib…
-
I want to implement sentencepiece BPE as my segmentation algorithm for my NMT task. My corpus size is less than 100k. Also, the source and target languages are very distant languages.
- Should I us…
-
目前我是这样推荐的最后 res=multi_label_cls_task.predict(data=encoded_data, label_list=label_list)
返回的是 标签 0 或者1 我想返回标签和对应的概率 请问应该怎么传参 文档里面找不到