Bug description
In extract.py, representation tensors are sliced using "len(strs[i])", which appears to be the length of full sequence and not the truncated sequence.
The output representation shape is (1023, embedding), which includes the eos token, but we expect (1022, embedding).
Reproduction steps
Run extract.py sequence longer than 1022 with truncation length 1022.
output['representation'].shape
Bug description In extract.py, representation tensors are sliced using "len(strs[i])", which appears to be the length of full sequence and not the truncated sequence. The output representation shape is (1023, embedding), which includes the eos token, but we expect (1022, embedding).
Reproduction steps Run extract.py sequence longer than 1022 with truncation length 1022. output['representation'].shape
Expected behavior (1022, E)
This problem was probably introduced in last commit (https://github.com/facebookresearch/esm/pull/278)
Thanks