Rostlab / SeqVec

Modelling the Language of Life - Deep Learning Protein Sequences
http://embed.protein.properties
MIT License
116 stars 13 forks source link

Randomness for a protein sequence embedding #24

Closed pzhang84 closed 2 years ago

pzhang84 commented 2 years ago

Awesome pre-trained models! I am using the provided pre-trained model to do the protein sequence embedding, and it seems that the model produces different embeddings each time for the same protein sequence (the content of embeddings are similar though). Does that mean the model is still training even in the embedding process? I wonder if you could share any insight into that? Thanks!

konstin commented 2 years ago

The initial cell and hidden state are randomly initialized, so the first few embeddings are somewhat random until it starts to converge. You can avoid this by running some warmup sequences (see e.g. https://github.com/sacdallago/bio_embeddings/blob/a9cb5eb90dd13814fe59ef9aeef797be0b99b4e6/bio_embeddings/embed/seqvec_embedder.py#L72-L76)