Batch embeddings differ from individual processing

Bitbol-Lab / ProtMamba-ssm

ProtMamba: a homology-aware but alignment-free protein state space model

Apache License 2.0

44 stars 7 forks source link

I am evaluating the use of the last vector in the last hidden layer as an embedding for a given input sequence.

I noticed that if I pass multiple sequences in a batch, I get a different embedding than if I pass them in one at a time.

For example:

tokens = tokenizer(['MEEP','MLEP'],concatenate=False)
pos_ids = [torch.tensor([0,1,2,3,4]) for i in range(2)]

tokens = torch.nn.utils.rnn.pad_sequence(tokens, batch_first=True, padding_value=AA_TO_ID["<pad>"]).to('cuda')
pos_ids = torch.stack(pos_ids).to('cuda')

hidden_layers = model(input_ids = tokens, position_ids = pos_ids, save_layer = [16])[16]
embeddings = hidden_layers[:,-1,:] #last token of each sequence

will return different embeddings for the first sequence than:

tokens = tokenizer(['MEEP'])
pos_ids = [torch.tensor([0,1,2,3,4])]

tokens = torch.nn.utils.rnn.pad_sequence(tokens, batch_first=True, padding_value=AA_TO_ID["<pad>"]).to('cuda')
pos_ids = torch.stack(pos_ids).to('cuda')

hidden_layers = model(input_ids = tokens, position_ids = pos_ids, save_layer = [16])[16]
embeddings = hidden_layers[:,-1,:] #last token of each sequence

model = load_model(checkpoint, model_class=MambaLMHeadModelwithPosids, device=device, dtype=torch.float32, checkpoint_mixer=False ).eval()

Bitbol-Lab / ProtMamba-ssm

Batch embeddings differ from individual processing #9