Tokenising multi-chain proteins

agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.

Academic Free License v3.0

1.05k stars 150 forks source link

Tokenising multi-chain proteins #131

Closed exs-hkenlay closed 8 months ago

exs-hkenlay commented 8 months ago

Thank you for this work and open sourcing these models.

I have a question about how you pre-processed proteins with multiple chains when preparing training data. Given a protein with multiple chains did you consider each chain as a separate inputs, or did you make use of the separation token (e.g. [SEP] in Bert and </s> in T5) to indicate different chains on the same line?

mheinzinger commented 8 months ago

Sorry to have bad news for you: our model saw only "single chain" proteins as we simply took protein sequences from UniProt/UniRef and BFD/metagenomic_data .

exs-hkenlay commented 8 months ago

Thanks for clarifying @mheinzinger