-
## Description
I was experimenting with the `sentence-transformers/msmarco-roberta-base-ance-firstp` model and observed some discrepancies between the outputs of the tokenizer depending on how the …
-
Suppose we have a template sentence like this:
- "The ____ house is our meeting place."
and we have a list of adjectives to fill in the blank, e.g.:
- "yellow"
- "large"
- ""
Note that on…
-
In your custom data loader:
```python
class CustomDataset(Dataset):
def __init__(self, tokenizer, sentences, labels, max_len):
self.len = len(sentences)
self.sentences = sen…
-
There is an issue with '\n' not working properly in llama3. When passing '\n' through tokenizer.encode, it outputs the token ID 198, but it does not terminate the sentence generation appropriately and…
-
public static void utf8ToGbk() throws Exception {
String fileName = "c:/tokenizer.json";
List lines = Files.readAllLines(Paths.get(fileName), Charset.forName("utf-8"));
String sentenc…
-
It is possible to mark verbs in German that has a prefix in the tokenizer python script.
If a word is marked, and has the same lemma as another word in the same sentence, I think they 99% belong t…
-
Hi,
I want to check if combination with tf-idf weights and tokens embeddings is better representation for my use case/data(I would love to know what you think about it).
Searching for implementation…
-
-
Cannot load the model.
code
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("../../models/consbert/unsup-consert-base-atec_ccks") # the model path
Error messag…
-
Hi,
I use `pytorch_pretrained_BERT/examples/python run_lm_finetuning.py` to fit the model with monolingual set of sentences. I use bert multilingual cased model.
Once the model is fine-tuned, I g…