Thanks for making the model available.
I have been playing with the model and realized that usually when making prediction of a sequence of DNA, usually the last token is not the one in the original sequence. The predictions usually have some extra nucleotides at the end of the sequence.
Am i missing something? Is this the expected behavior? Is there a expected nucleotide input length which fix this behavior?
for dna in sequences:
dna = dna[:128]
inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
hidden_states = model(inputs)
logits = hidden_states.logits
# Apply softmax to convert logits to probabilities
probabilities = softmax(logits, dim=-1)
# Choose the most likely token for each position
predicted_token_ids = torch.argmax(probabilities, dim=-1)
print('original tokens', inputs)
print('predicted tokens', predicted_token_ids)
print()
# Convert these token ids back to nucleotides
predicted_sequences = [tokenizer.decode(token_ids) for token_ids in predicted_token_ids[:,1:]]
original = [tokenizer.decode(token_ids) for token_ids in inputs]
print('Original', dna)
print('Predicted',' '.join(predicted_sequences).replace(' ', ''))
print()
Hi,
Thanks for making the model available. I have been playing with the model and realized that usually when making prediction of a sequence of DNA, usually the last token is not the one in the original sequence. The predictions usually have some extra nucleotides at the end of the sequence.
Am i missing something? Is this the expected behavior? Is there a expected nucleotide input length which fix this behavior?