[Preprocessing] One-hot encoding sequences

ayushkarnawat / profit

Exploring evolutionary protein fitness landscapes

MIT License

1 stars 0 forks source link

[Preprocessing] One-hot encoding sequences #88

Closed ayushkarnawat closed 4 years ago

ayushkarnawat commented 4 years ago

Should there be a one-hot encoding scheme for the AA sequence? Currently, we use a integer-encoding, which may work (additional testing needs to be done on which approach is likely to generalize better).

ayushkarnawat commented 4 years ago

This can easily be done JIT (before passing in through the model) using the following:

# data.size() = (N, L), where N=batch_size and L=length of AA sequence
batch_size, seqlen = data.size()
onehot = torch.zeros(batch_size, seqlen, vocab_size)
onehot.scatter_(2, torch.unsqueeze(data, 2), 1)
# onehot.size() = (N, L, v), where v=vocab_size