facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

should stop token be in sequence representation? #29

Closed mboedigh closed 3 years ago

mboedigh commented 3 years ago

The example code in the Quick Start section of the github readme page shows this excerpt:

sequence_representations = []
for i, (_, seq) in enumerate(data):
    sequence_representations.append(token_representations[i, 1 : len(seq) + 1].mean(0))

The sequence_representations will then include the last position of token_representations, which appears to be the stop token. Is this intended?

joshim5 commented 3 years ago

Hi @mboedigh, the stop token is not included in the sequence representation. We take token representations until len(seq)+1 (and not +2) for that reason.

mboedigh commented 3 years ago

Sorry, something I still don't understand if tokens are [0,5,5,2] for a sequence = 'AA', where 0 and 2 are the begin and end tokens then tokens[len(seq)+1] will index the 2, This is the stop token, right? tokens[len(seq)+2] is out of bounds

joshim5 commented 3 years ago

In the code example you posted, seq_i corresponds to the original sequence (len x), but token_representations[i].size(0) == x+2.

Does this code example help to clear things up?

>>> begin, end = 0, 2
>>> seq = [5,5]
>>> tokens = [begin] + seq + [end]
>>> tokens[1 : len(seq) + 1]
[5, 5]
mboedigh commented 3 years ago

yes. thanks! I guess python 1:x is not like some other languages. I assumed too much, but also I was somehow getting 'out of bounds errors' in my own tests before I posted. thanks again.