-
@cshanbo @lukaszkaiser
Hi, I've read all your discussion in #111 but I don't know your tests' result on segmented Chinese dataset and non-segmented Chinese dataset. I'm using t2t on en-zh translatio…
-
hi, when i am running TranslateEnzhWmt8k, when i set the vocabulary size to 8K, the generated chinese vocabulary is normal, but when set it to 32K, the file contains lots of lines like this:
it seems…
yuimo updated
6 years ago
-
Your reported M2 score on CoNLL2014 is 57.53.
In your paper, the M2 score is 55.8.
-
Hi, thank you for the great work and for your continued involvement with the community over the years.
I am preparing fine-tune data for en-zh and zh-en Helsinki model for use in Marian NMT.
I t…
-
Since SentencePiece does not split unknown token by default which resulting larger vocabulary in tokenized corpora. Also, the Tokenizer seems not work very well with spacer, more segmentation made by …
-
I get poor results for translate_ende_wmt_bpe32k (this may be related to issues #317 , #309 ), and believe this is due to a problem with reserved tokens not being respected by TokenTextEncoder, and th…
-
```2019-12-06 00:18:07] Using single-device training
[2019-12-06 00:18:07] [data] Loading vocabulary from JSON/Yaml file /191206/source_vocab.yml
[2019-12-06 00:18:08] [data] Setting vocabulary size…
sdlmw updated
2 years ago
-
Data is tokenized 2 times :
1. With Stanford CoreNLP : https://github.com/nlpyang/PreSumm/blob/ba17e95de8cde9d5ddaeeba01df7cace584511b2/src/prepro/data_builder.py#L110
2. With HuggingFace's Bert…
-
## Summary
This repo's tokenizer works consistently with [huggingface's tokenizer](https://github.com/huggingface/tokenizers) in the most cases and works inconsistently but possibly better [bge zh …
-
Currently, we use SentencePiece in Tokenizer for our models contain ZH/JA in which no space serves as a natural word boundary.
The SentencePiece model is applied after Tokenizer's `none` mode.
`node…