-
SoMeWeta uses the Tagset STTS_IBK for tagging. One of the differences between STTS and STTS_IBK is the Tag Action words (AKW), e.g. for German *lach* (Beißwenger, Bartz, Storrer und Westpfahl, 2015).
…
-
```python
import torch
from mingpt.bpe import BPETokenizer
tokenizer = BPETokenizer()
print(tokenizer("")) # tensor([[ 27, 91, 437, 1659, 5239, 91, 29]])
print(tokenizer.decode(torch.te…
-
First, thank you for this addon, I needed something to organise my revision process.
I'd like to offer an idea.
When memorising something like T.S.Eliot I find that the lines in the poem aren't su…
UrKr updated
9 months ago
-
Hi,
Thanks a lot for sharing the code with us, interesting work!
I have a question regarding tokenization for GPT-2.
I've seen that you add an EOS token at the end of every sentence in each text ex…
-
Since version [v4.36.0](https://github.com/huggingface/transformers/releases/tag/v4.36.0) of huggingface transformers, it is not allowed to have prefix_allowed_tokens_fn return an empty set of tokens …
-
In the merge sentences modifiers, it uses whitespace tokenization:
https://github.com/hplt-project/OpusTrainer/blob/9ec77d3745823f9e05016700938e6b2ffbb770e0/src/opustrainer/modifiers/merge.py#L12-L…
-
### Your current environment
Packages used for both finetuning and inference (vllm==0.3.2):
torch==2.1.2
accelerate==0.27.2
transformers==4.40.1
sentence_transformers==2.7.0
Description:
…
-
Hello ! I'm trying to implement bert-base but I have not clear how do you generate the masks with the TapeTokenizer. This is my code
```
model = ProteinBertModel.from_pretrained('bert-base')
tokeni…
-
Unfortunately, `BreezeSentencer` uses `Tokenizer.computeOffsets` to compute offsets from the resulting sentences, so simply adding `require(string.forall(!_.isWhitespace))` breaks `BreezeSentencer`.
-
I noticed a number of various things are incorrectly implemented.
```python
classifier = pipeline("sentiment-analysis", device="cpu",
model="distilbert/distilbert-base-uncased-fin…