facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

Is the truncation code in extract.py reasonable? #157

Closed diamondgloves closed 2 years ago

diamondgloves commented 2 years ago

Discussed in https://github.com/facebookresearch/esm/discussions/156

Originally posted by **diamondgloves** January 13, 2022 When I use ESM-1b model to extract representations of the sequences with lengths over 1024, the tokens of the sequence is supposed to be 'bos + seq + eos' with length of seq_len+2.According to the code in extract.py, toks becomes 'bos+seq[:1021]' without eos before feeding the model, could this be a reasonable input? ``` if args.truncate: toks = toks[:, :1022] out = model(toks, repr_layers=repr_layers, return_contacts=return_contacts) ``` Furthermore, in ContactPredictionHead the contact map will be cropped into size of 1020*1020 with the aim at removing the bos&eos, but there is no eos in the toks. Should the input toks of the model be as 'bos + seq[:1022] + eos' with length of 1024 after truncating?
tomsercu commented 2 years ago

Thanks for flagging. You're right, the truncation logic is incorrect. As for a fix - inserting bos and eos at first and last position will also be problematic if there are shorter seqs who may end up looking like bos [seq] <eos> pad pad pad eos. In fact the best fix would be to add a max_seq_len argument to BatchConverter initialization and truncate in the __call__

Alexperiments commented 2 years ago

Do you still need help on this issue?

tomsercu commented 2 years ago

Yes sure we'd welcome an improvement to this script.

AhmedIssa11 commented 2 years ago

@tomsercu can i do it as my first contribution? also can you explain please how to add a max_seq_len argument to BatchConverter initialization and truncate in the call