-
### System Info
- `transformers` version: 4.45.2
- Platform: Linux-5.4.0-193-generic-x86_64-with-glibc2.31
- Python version: 3.12.7
- Huggingface_hub version: 0.25.2
- Safetensors version: 0.4.5
…
-
Since OpenSearch 2.13, [**fixed token length algorithm**](https://opensearch.org/docs/latest/ingest-pipelines/processors/text-chunking/#fixed-token-length-algorithm) is available in text chunking proc…
-
### System Info
latest transformers
### Who can help?
@ArthurZucker
### Information
- [ ] The official example scripts
- [x] My own modified scripts
### Tasks
- [ ] An officiall…
-
If I send in "17 júní" the tokenizer returns 17. júní". Even though I use tokenized() (and not split_itsentences()) and use the txt-property (which should contain the original source text for the toke…
-
Hi.
First of all, thank you for making such a model available to us.
I am trying to get vector embeddings of abstracts of some of the articles in PubMed. But somehow I couldn't get the sentence embe…
-
Hi,
I was trying to create a custom tokenizer for a different language which is not included in llama 3.2 tokenizer.
I could not find exactly what tokenizer I can use from hf which is exact altern…
-
I am trying to full fine tune Llama3.2-1b to "teach" it another language (via continous pretraining).
he idea is to have a model, which, given a prompt in a language , it continues the sentences in…
-
punkt is loaded in as a pickle file which is not secure CVE-2024-39705 so you have to use punkt_tab now.
This breaks `_get_sentence_tokenizer`.
In order to use the Tokeniser class I had to overrid…
-
Hello, thank you for the library.
I've written a free program for learning languages called Lute (https://github.com/LuteOrg/lute-v3), and it would be nice to add Thai support. This library looks …
-
```python
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/mistral-7b-bnb-4bit",
max_seq_length=2048,
load_in_4bit=Tru…