Xiaoxiao0606 / BERT2DAb

A pre-trained model for antibody squences
5 stars 1 forks source link

How to get embeddings from BERTT2DAb? #1

Open trashTian opened 4 months ago

trashTian commented 4 months ago

There is my code:

H_chain = 'QVQLVESGGGSVQAGGSLSLSCAASTYTDTVGWFRQAPGKEREGVAAIYRRTGYTYSADSVKGRFTLSQDNNKNTVYLQMNSLKPEDTGIYYCATGNSVRLASWEGYFYWGQGTQVTVSS'

# H_chain = 'QVQLLESGAELVKPGASVKLSCKASGYTFTSYWMHWVKQRPGRGLEWIGMIDPNSGGTKYNEKFKSKATLTVDKPSNTAYMQLSSLTSEDSAVYYCTRRDMDYWGAGTTVTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTKVDKKIVPKS'

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer_H = BertTokenizer.from_pretrained("w139700701/BERT2DAb_H")
model_H = BertModel.from_pretrained("w139700701/BERT2DAb_H")
model_H.to(device)
encoded_input = tokenizer_H.encode_plus(
            H_chain,
            add_special_tokens=True,
            return_tensors="pt"
        )
input_ids = encoded_input["input_ids"].to(device)
attention_mask = encoded_input["attention_mask"].to(device)

with torch.no_grad():
    outputs = model_H(input_ids, attention_mask=attention_mask)

embeddings = outputs.last_hidden_state
print(embeddings)

But for different input sequences, the same embedding will be obtained

tensor([[[ 0.0577,  0.0189, -0.0028,  ...,  0.0071,  0.0055,  0.0153],
         [ 0.0191, -0.0732, -0.0640,  ..., -0.0157,  0.0195, -0.0091],
         [ 0.0577,  0.0189, -0.0028,  ...,  0.0071,  0.0055,  0.0153]]],
       device='cuda:0')

How should I modify the code to obtain the correct embeddings ?

Xiaoxiao0606 commented 4 months ago

Hello, Thank you for your interest in our work! To begin, it's recommended to employ a secondary structure annotation tool,such as proteinUnet, for splitting the sequence. Subsequently, these segmented sequence should be processed using BERT2DAb to generate their embedding representations.

We appreciate your attention to these details!