facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

extract.py truncation_seq_length error #302

Closed eric-tc-wong closed 1 year ago

eric-tc-wong commented 1 year ago

Bug description In extract.py, representation tensors are sliced using "len(strs[i])", which appears to be the length of full sequence and not the truncated sequence. The output representation shape is (1023, embedding), which includes the eos token, but we expect (1022, embedding).

Reproduction steps Run extract.py sequence longer than 1022 with truncation length 1022. output['representation'].shape

Expected behavior (1022, E)

This problem was probably introduced in last commit (https://github.com/facebookresearch/esm/pull/278)

Thanks

tomsercu commented 1 year ago

thanks for identifying! do you want to submit a fix?

eric-tc-wong commented 1 year ago

I just submitted a pull request. Thanks