Open amoskalev opened 8 months ago
Here's how I'm currently solving this (adapted from usage in README) :
from evo import Evo
import torch
device = 'cuda:0'
evo_model = Evo('evo-1-131k-base')
model, tokenizer = evo_model.model, evo_model.tokenizer
model.to(device)
model.eval()
# monkey patch the unembed function with identity
# this removes the final projection back from the embedding space into tokens
# so the "logits" of the model is now the final layer embedding
# see source for unembed - https://huggingface.co/togethercomputer/evo-1-131k-base/blob/main/model.py#L339
from torch import nn
class CustomEmbedding(nn.Module):
def unembed(self, u):
return u
model.unembed = CustomEmbedding()
# end custom code
sequence = 'ACGT'
input_ids = torch.tensor(
tokenizer.tokenize(sequence),
dtype=torch.int,
).to(device).unsqueeze(0)
embed, _ = model(input_ids) # (batch, length, embed dim)
print('Embed: ', embed)
print('Shape (batch, length, embed dim): ', embed.shape)
# you can now use embedding for downstream classification tasks
# you probably want to aggregate over position dimension
# e.g. mean value = embed.mean(dim=1) or final token embedding = embed[:, -1, :]
Note that this is for the model object returned by evo-model, which is an instance of StripedHyena. If you are using Huggingface directly, this is wrapped with StripedHyenaModelForCausalLM, so you need to do model.backbone.unembed = CustomEmbedding()
Thanks @davidkell !
@davidkell I tried your code on a A100 40GB using the evo-8k model, embedding the 4-letter sequence in the example costs over 400MB GPU RAM, the model itself needs 13GB. The embedding dimension is 4096. I don't understand why it cost so much memory. 4x4096 BF16 should only take 32KB, right? I tried to embed a 2kb sequence but always ran out of cuda memory. Anyone has a similar problem?
I had a similar experience. I was able to get inference working for 2k sequences on A100 80GB (e.g. available on Paperspace), although around 2.5-3k I would get OOM. I haven't looked in depth on what is driving the memory requirement
Quoting from this issue https://github.com/evo-design/evo/issues/24:
Prompting with longer sequences requires sharding for the model, which is currently not supported
So I think if you want to generate embeddings for longer sequences, you will need to manually shard on GPUs or setup CPU offloading or something like that
Hi, thanks for your amazing work!
How can I extract representations rather than logits from the model?
I am using the huggingface version, and I see the model returns
logits' and 'past_key_values
. Could you please explain what's inpast_key_values
and if anything of those can be used as a sequence representation? Or maybe you can suggest other ways to access representations of a model?