-
We are currently using subword-nmt bpe tokenizer for a job and are using its "Glossary" parameters to be able ignore certain symbols using regular expressions.
I understand that Tokenizers has the …
-
BERT and related models have been using statistical tokenization algorithms. These work well on out-of-vocab words with ML models. High-speed implementations of BPE / WordPiece etc. will be good addit…
-
Hello Guillaume,
I came into an issue when using `vocabulary_path`.
Normally, with the use of `vocabulary_path`, we would expect the output sentence does not contain vocab below a certain threshold …
-
This may not exactly be an issue, but a question that I could not find an answer in the documentation. I hope this is the correct platform to ask such questions.
I am trying to work with Turkish, and…
-
Hello,
I want to train vocabulary on the custom text corpora and lately to add this vocabulary to pre-trained BERT vocabulary.
The thing is that pre-trained vocabulary has its intra-word boundary …
-
This tracks the testing status of #146 with existing projects.
[Header-only docs](https://github.com/cfis/rice/blob/dev/README.md)
Each project uses the `rice-header-only` branch
Project | St…
-
Hello,
I wanted to use BARThez with HuggingFace but it seems like I can't load the BARThez checkpoint.
I tried to execute your HuggingFace exemple:
```python
text_sentence = "Paris est la cap…
-
Hi there,
I recently started going through the code in this repository after having read your paper, which I found very fascinating.
I would be very interested in trying to reproduce the results…
-
### Bug description
When a line that starts with too many encoded apostrophes (i.e. ') is passed as input, marian-decoder stops on it, ignoring the rest of the input. For example, giving it …
-
Is the `--sentencepiece-alphas` in Marian CLI the same as the alpha on https://github.com/google/sentencepiece/blob/master/src/bpe_model.h#L43 to support BPE dropout when called at https://github.com/…