subword-segmentation Search Results

152 results
for subword-segmentation

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

tensorflow/tensor2tensor #697

Whether to do Chinese word segmentation before t2t-datagen w…

@cshanbo @lukaszkaiser Hi, I've read all your discussion in #111 but I don't know your tests' result on segmented Chinese dataset and non-segmented Chinese dataset. I'm using t2t on en-zh translatio…

SkyAndCloud updated 6 years ago
8
tensorflow/tensor2tensor #443

problem about vocabulary size

hi, when i am running TranslateEnzhWmt8k, when i set the vocabulary size to 8K, the generated chinese vocabulary is normal, but when set it to 32K, the file contains lots of lines like this: it seems…

yuimo updated 6 years ago
5
grammatical/neural-naacl2018 #2

Why is the M2 score in models/README.md different from that …

Your reported M2 score on CoNLL2014 is 57.53. In your paper, the M2 score is 55.8.

h-asano updated 5 years ago
2
Helsinki-NLP/OPUS-MT-train #71

Preparing fine-tune data for Marian

Hi, thank you for the great work and for your continued involvement with the community over the years. I am preparing fine-tune data for en-zh and zh-en Helsinki model for use in Marian NMT. I t…

JoeTseHot updated 2 years ago
2
Zenglinxiao/Tokenizer #2

Tokenizer with BPE

Since SentencePiece does not split unknown token by default which resulting larger vocabulary in tokenized corpora. Also, the Tokenizer seems not work very well with spacer, more segmentation made by …

Zenglinxiao updated 4 years ago
2
tensorflow/tensor2tensor #365

TokenTextEncoder does not respect reserved tokens

I get poor results for translate_ende_wmt_bpe32k (this may be related to issues #317 , #309 ), and believe this is due to a problem with reserved tokens not being respected by TokenTextEncoder, and th…

rsennrich updated 6 years ago
15
marian-nmt/marian #300

tcmalloc: large alloc 2818572288 bytes == 0x33daa000 @

```2019-12-06 00:18:07] Using single-device training [2019-12-06 00:18:07] [data] Loading vocabulary from JSON/Yaml file /191206/source_vocab.yml [2019-12-06 00:18:08] [data] Setting vocabulary size…

sdlmw updated 2 years ago
12
nlpyang/PreSumm #66

Why tokenizing 2 times ?

Data is tokenized 2 times : 1. With Stanford CoreNLP : https://github.com/nlpyang/PreSumm/blob/ba17e95de8cde9d5ddaeeba01df7cace584511b2/src/prepro/data_builder.py#L110 2. With HuggingFace's Bert…

astariul updated 2 years ago
5
xyzhang626/embeddings.cpp #1

Tokenizer works inconsistently but better for bge zh series …

## Summary This repo's tokenizer works consistently with [huggingface's tokenizer](https://github.com/huggingface/tokenizers) in the most cases and works inconsistently but possibly better [bge zh …

xyzhang626 updated 10 months ago
4
Zenglinxiao/Tokenizer #1

Tokenizer with SentencePiece

Currently, we use SentencePiece in Tokenizer for our models contain ZH/JA in which no space serves as a natural word boundary. The SentencePiece model is applied after Tokenizer's `none` mode. `node…

Zenglinxiao updated 4 years ago
7

上一页 1...1 2 3 4 5 6 7...16 下一页

152 results for subword-segmentation

152 results
for subword-segmentation