Any suggestions for extracting embeddings for sequences with > 1024 residues?

facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

MIT License

3.16k stars 627 forks source link

Any suggestions for extracting embeddings for sequences with > 1024 residues? #21

Closed ptkim1 closed 3 years ago

ptkim1 commented 3 years ago

Could I split the sequence into 1024 length chunks, run each separately with the BOS and EOS tokens occurring in the first and last chunks, concatenate the resulting embeddings, then take the average?

Seems like since during training the model used random crops of >1024 length sequences, this should work, but want to make sure.

Also, some warning that your sequence is too long might be helpful, since as of now trying to embed a larger than 1024 length sequence while running on gpu results in the unhelpful "device-side assert triggered" CUDA runtime error.

joshim5 commented 3 years ago

Yes, during the training process we cropped sequences >1024 sequences, so taking crops is a sensible choice. We haven't experimented with concatenating or averaging the resulting embeddings, but there are a number of things you could try. For example, assuming the sequence is of length 3072, you could:

split into 3 chunks of length 1024 and concenate
split into 2048 chunks (start position = 0, 1, 2, ..., 2048) and average appropriately
split into chunks of different lengths
many other possibilities... It's possible that these will all perform similarly, but as this is still an open research question, we would be eager to hear what you find.

tomsercu commented 3 years ago

Because I'm referencing this issue in the github discussions, let me add another option to Josh' list: If you know domain boundaries (or can predict them), that would be a good way to split up the protein sequences. Potentially again with averaging the embeddings over a strided window of consecutive domains.

aliencaocao commented 1 year ago

so is there no need to add BOS and EOS in EACH chunk?