-
hi to the community!
recently i'm training a BPE tokenizer with an existing large corpus(reading them all into memory is not feasible).
the corpus was not common one-text-per-line file (for ex…
-
### 🐛 Describe the bug
An attempt to run this in Colab or docker, etc. on a GPU fails due a segmentation fault (see trace below)
https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/ASR_with_Sub…
-
Is this recipe used to tune the tatoeba models that were already trained? I am hoping to provide data to it to tune multilingual tatoeba models but I am not sure where this recipe is pulling data from…
-
How can I use BPE-Dropout? I don't see any changes if I try out different alpha values for BPE model.
-
Here's a few feature requests + bugs related to the punctuation and capitalization model.
### Punctuation issues
#### Inverted punctuation
For languages like Spanish, we need two predictions per …
-
# 🌟 New model addition
Hi!
I was wondering if there's been any work on adding the 12B version of m2m100 model to huggingface.
Given libraries such as fairscale or parallelformers, inference wit…
-
The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts. This is because it is unaware of unicode b…
-
Background: I am trying to build an automated pipeline to segment sentence from the result of Google Speech-to-Text service.
Issue: The `-s` parameter does not work as expected. See details below. An…
-
Hi again,
After getting the NAN loss error from the previews issue, I launched another training during the weekend:
```
python3 pretrain_nmt.py -n 1 -nr 0 -g 2 --model_path models/bart_base_512 \…
-
https://github.com/kh-kim/subword-nmt using this repo
due to too many unique value in tokenizer vocab
Ldoun updated
2 years ago