ElArkk / jax-unirep

Reimplementation of the UniRep protein featurization model.
GNU General Public License v3.0
104 stars 31 forks source link

embed_matrix:0.npy in evotuned dumped weights #83

Closed mengqvist closed 3 years ago

mengqvist commented 4 years ago

I noted that in the original UniRep article their evotuned GFP weights contain an "embed_matrix:0.npy" file, whereas the weights dumped by your implementation using dump_params() does not. Is this a feature or a bug? It seems to me that it breaks compatibility between the two libraries.

Since this file seems static it is simple enough to copy it over from the original paper into the folder with the evotuned weights, but I would find it desirable to have it there as a default.

ElArkk commented 4 years ago

Hi @mengqvist ,

Good catch. embed_matrix:0.npy contains the vectors for the initial 10-dimensional embedding of each amino acid (the embedded seq's then get passed to the mLSTM).

In the original UniRep implementation, this embedding matrix gets randomly initialised, and then learned together with the mLSTM weights during training.

So far, we have not implemented this embedding layer in jax-unirep, and so we always use the embedding matrix from the original publication, which was learned using the UniRef50 dataset. It has been in the back of my head for a while now to "complete" the re-implementation by also implementing the embedding layer, to be able to generate custom embeddings during evotuning (or complete re-training of the model).

For now, I think it's a good idea to dump the embedding matrix together with the rest, to make sure dumped weights can be used by both libraries. Let me know if you'd like to submit a small PR yourself to change this behaviour.