How to get embedding vector when input is tokenized already

sogm1 commented 6 months ago

First, thank you so much for sentence-transformer.

How to get embedding vector when input is tokenized already?

i guess sentence-transformer can .encode(original text).

But i want to know there is way like .encode(token_ids ) or .encode(token_ids, attention_masks)

This is my background below

I trained model using sentence-transformer. and i add few layers to this model for classification.

and then i want to train model to update all of parameter (including added layers).

but DataLoader cuda() support only tokens_id not text , so first i tokenized text using model.tokenizer() .

so, it is already tokenized i need to know how to get embedding if i have token_ids,

regards

tomaarsen commented 6 months ago

Hello!

Are you using a custom training loop or something? If you added extra layers, then the default training probably will not work anymore: it expects the model to output e.g. "sentence_embedding" while your classification model probably outputs classes instead. So, if you're using a custom training loop, then you can indeed tokenize your text and pass them to the forward or __call__ methods of the model (they're identical in torch).

Something along the lines of:

for batch in dataloader:
    # maybe tokenize the batch if that's not done already
    # maybe move the batch to the right device if that's not done already

    # 'batch' is a dictionary of "input_ids" and "attention_mask" keys
    output = model(batch)

    loss = loss_fn(output)
    loss.backward()
    # maybe optimizer step()
    # maybe scheduler step()

So, model.encode(...) is the interface for most users, and model(...) or model.forward(...) is how the model is actually accessed.

Also, this is a bit unrelated, but I've had pretty good luck with just training a Sentence Transformer model without any extra layers, and then training a LogisticRegression on top of it, by using roughly:

# assume that we have `texts` and `labels`
X = model.encode(texts)
classifier = LogisticRegression().fit(X, labels)

That could also be worth a shot.

Tom Aarsen

sogm1 commented 6 months ago

@tomaarsen

Hello, thank you for reply.

i have one more question, my fine-tuned-model could be trained by below training loop?

fine-tuned-model : trained by sentnece-transformer for sentence embedding

My custom customclassification Model this :


import torch
import torch
import torch.nn as nn
num_classes =select_top10_df["points_category"].nunique()

# setting
num_classes =select_top10_df["points_category"].nunique()

# model
class CustomBertModel(nn.Module):
    def __init__(self,  model, num_classes):
        super(CustomBertModel, self).__init__()
        self.encoder = model

        # finetunning config
        self.dropout = nn.Dropout(0.5)
        self.dense1 = nn.Linear(model[1].word_embedding_dimension, 768)
        self.tanh = nn.Tanh()
        self.dense2 = nn.Linear(768, num_classes)

        # Freeze BERT parameters
        for param in self.encoder.parameters():
          param.requires_grad = True

        #Add classification layer
    def forward(self, batch):
        output = self.encoder.forward(batch)["sentence_embedding"]
        output = self.dropout(torch.tensor(output))
        logits = self.dense1(output)
        logits = self.tanh(logits)
        logits = self.dropout(logits)
        logits = self.dense2(logits)
        return logits

model3 = CustomBertModel( fine_tuned_model, num_classes)  # fine_tuned_model : The Model trained by Sentence-Transformer 
model3

The result :

CustomBertModel(
  (encoder): SentenceTransformer(
    (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: RobertaModel 
    (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
  )
  (dropout): Dropout(p=0.5, inplace=False)
  (dense1): Linear(in_features=768, out_features=768, bias=True)
  (tanh): Tanh()
  (dense2): Linear(in_features=768, out_features=5, bias=True)
)

And then my training loop :

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import datetime

# Config
num_epochs =  15

optimizer = optim.Adam(model.parameters(), lr=1e-5)
criterion = nn.CrossEntropyLoss()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

for epoch in range(num_epochs):
    model3.train()  # Training mode
    total_loss = 0.0
    total_correct = 0

    for batch in train_dataloader:

        inputs = {
                  "input_ids": batch['input_ids'].to(device),
                  "attention_mask": batch['attention_mask'].to(device)
                  }
        labels = batch['labels'].to(device)
        optimizer.zero_grad()
        model3.to(device)
        outputs = model3(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        _, predicted = torch.max(outputs, 1)
        total_correct += (predicted == labels).sum().item()

    avg_loss = total_loss / len(train_dataloader)
    avg_acc = total_correct / len(train_dataset)
    print(f"Epoch [{epoch + 1}/{num_epochs}] - Loss: {avg_loss:.4f}, Accuracy: {avg_acc * 100:.2f}%")

    if avg_loss < best_train_loss:
        best_train_loss = avg_loss
        counter = 0
        # Save the best model
        torch.save(model.state_dict(), 'my_directory')
    else:
        counter += 1

    # Check if early stopping criteria are met
    if counter >= patience:
        print(f"Early stopping after {epoch + 1} epochs without improvement.")
        break

but i got a error message 😥 :


---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-69-168d05b7ae23> in <cell line: 16>()
     30         outputs = model3(inputs)
     31         loss = criterion(outputs, labels)
---> 32         loss.backward()
     33         optimizer.step()
     34 

1 frames
/usr/local/lib/python3.10/dist-packages/torch/_tensor.py in backward(self, gradient, retain_graph, create_graph, inputs)
    490                 inputs=inputs,
    491             )
--> 492         torch.autograd.backward(
    493             self, gradient, retain_graph, create_graph, inputs=inputs
    494         )

/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    249     # some Python versions print out the first line of a multi-line function
    250     # calls in the traceback and some print out the last line
--> 251     Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    252         tensors,
    253         grad_tensors_,

RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

UKPLab / sentence-transformers

How to get embedding vector when input is tokenized already #2494