-
Hi,
I'm using the "**Tokenization task**" and I learned that when there is special characters, the implementation will apply escape character rule on it:
Example:
value: Test$$!Jungle18 will be…
-
This issues describes the high level directions that "create LLM Engine V1". We want the design to be as transparent as possible and created this issue to track progress and solicit feedback.
Goal…
-
SentenceVAE/
│
├── encoder.py
│ ```python
│ import torch
│ from torch import nn
│
│ class SentenceEncoder(nn.Module):
│ '''Sentence Encoder with byte-level BPE tokenization, lear…
-
### Describe the bug
This is a minor issue and probably not affecting the results so much. I noticed that special token IDs are often not matching the tokenizer's configuration. For example in http…
-
Hey guys thanks for this awesome adaptation of CANINE 😊 I've been working on adapting for any language and I came across weird empty masks. I think the problem is in `training/masking.py` in the funct…
-
I took a look through the source but couldn't find a way to serialize the Yaml struct to a string. Any plans to support this?
-
# 🚀 Feature request
For `BertTokenizerFast` (inherited from `PreTrainedTokenizerFast`), it seems like `__call__` only supports truncating from the end of the sequences if we set `truncation` to be …
-
- [ ] [Transformer Models](https://docs.cohere.com/docs/transformer-models#the-softmax-layer)
# Transformer Models
**Description:**
- **Tokenization**
Tokenization is the most basic step. It consi…
-
Hello,
I believe the corpus and the `word_freqs` output used in the [BPE](https://github.com/huggingface/course/blob/main/chapters/en/chapter6/5.mdx#implementing-bpe) / [WordPiece](https://github.c…
-
## Description
https://github.com/dmlc/gluon-nlp/pull/1374 has been merged so we have fixed the warnings in our documents. However, the current structure of the website is not very satisfactory and…