-
It would be great to see EN-PL model!
-
I've found at least one bug in the implementation:
https://github.com/hplt-project/OpusTrainer/issues/53
-
**Describe the bug**
A lot of unit tests in NLP collection (over 10) require correct version of ``/home/TestData`` folder (from internal CI machines) to be present to run successfully.
**This make…
-
@taku910
Hi Team , how to stop mdoel tokenize numbers ?
i tried `split_by_number =False ` and `split_by_digit = False `
but still number isbeing tokenized into multiple digits
Example I…
-
### Feature request
Give access to setting a `pre_tokenizer` for a `transformers.PreTrainedTokenizer`, similar to how this works for `PreTrainedTokenizerFast`.
### Motivation
As far as I un…
-
### Description
When running `pythainlp.romanize("ไกรฤกษ์ โชติวุฒิวินิจ")`, throws an `IndexError`
### Expected results
Should return something like Krairiksh Chotiwutwinit (approximately)
…
-
### System Info
ValueError: An instance of tokenizer class BioGptTokenizer cannot be converted in a Fast tokenizer instance. No converter was found.
I am using microsoft/biogpt for token classifi…
-
_Obviously_ has to be in Rust, as we desperately need to be trendy. Jokes aside, it'd be a good opportunity to enhance the tool further:
- [x] single binary is nice, users don't need to install a P…
-
I encountered the following problem while training the 3D point cloud model:
```bash
[2023-07-20 08:48:30,841][torch_points3d.datasets.base_dataset][INFO] - Available stage selection datasets: ['…
-
In this case the llama.cpp and the llama tokenizers produce different output:
```
main: prompt: 'This is 🦙.cpp'
main: number of tokens in prompt = 10
1 -> ''
4013 -> 'This'
338 -> ' …