agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Academic Free License v3.0
1.13k stars 153 forks source link

Feature order #149

Closed abelavit closed 4 months ago

abelavit commented 7 months ago

Hello,

I am curious about the order of features we get from the embeddings of pre-trained transformer model. If we get feature F1, F2, ... , F1024 (dimension 1x1024) from ProtT5 for each amino acid residue and we change the feature order, e.g. F24, F439, ... , F304 (dimension 1x1024), will it result in loss of information? If the order is important, would models like LSTM be more suitable for building a prediction model rather than algorithms like Random Forests which does not look at feature order?

Thank you.

mheinzinger commented 6 months ago

Nope, the order of features does not matter. You should be able to extract embeddings for a dataset, shuffle the dimension of the embeddings (consistenly between protein, e.g. making F1 to F512 should be consistent between proteins), train a predictor and get identical performance (or near identical, depending on how you handle RNG on dataset sampling, weight init. etc.)

abelavit commented 6 months ago

Thank you so much.