-
I have a large corpus, around 40GB of text. I install subword-nmt via pip and try to make the dictionary with subword-nmt command line and it takes forever to finish. I just wonder whether there any s…
-
I trained two vocabularies with about 900M Chinese-English materials, and then coded two data sets (900M training set and 500K test set) with these two Chinese-English vocabularies.
The training se…
-
Hi,
It is mentioned in the paper that a SentencePiece vocab of size 5K was created for both, English and Portuguese. So was something like `max_length` was set for the sentences or did you use all …
-
Here are some feedbacks we got in class yesterday.
1. Chinese and Japanese don't use whitespace, but their characters are logogram. (The unique number of characters is large.) What happens if we tr…
-
I understand that by removing the `@@ ` symbols I get back to the input text, but how can I identify the smallest subunits in the processed text?
If for example I have `di@@ rect`, How can I figure…
-
NOTE: I'm referring to the RESULTS file on the current Kaldi commit, not goodatleas/zeroth
Hi, I tried running the provided recipes for zeroth_korean on kaldi. I didn't change anything on the scri…
-
# Next paper candidates
Let's propose papers to study next! All papers mentioned in the comments of this issue will be listed in the next vote.
-
I used fairseq-interactive and fairseq-generate respectively to decode the same file, but the result is slighly different. The result generated from fairseq-generate outperformed the result from fairs…
gxzks updated
5 years ago
-
I would like to use fastText for languages that don't have clear word boundaries, such as Chinese, Japanese, Thai or Vietnamese. I have found various softwares to partition text from these languages …
-
# Next paper candidates
Let's propose papers to study next! All papers mentioned in the comments of this issue will be listed in the next vote.
## Last session runner-up
[Graph Attention Networ…