Closed mengqvist closed 3 years ago
Hi @mengqvist ,
Good catch. embed_matrix:0.npy
contains the vectors for the initial 10-dimensional embedding of each amino acid (the embedded seq's then get passed to the mLSTM).
In the original UniRep implementation, this embedding matrix gets randomly initialised, and then learned together with the mLSTM weights during training.
So far, we have not implemented this embedding layer in jax-unirep
, and so we always use the embedding matrix from the original publication, which was learned using the UniRef50 dataset. It has been in the back of my head for a while now to "complete" the re-implementation by also implementing the embedding layer, to be able to generate custom embeddings during evotuning (or complete re-training of the model).
For now, I think it's a good idea to dump the embedding matrix together with the rest, to make sure dumped weights can be used by both libraries. Let me know if you'd like to submit a small PR yourself to change this behaviour.
I noted that in the original UniRep article their evotuned GFP weights contain an "embed_matrix:0.npy" file, whereas the weights dumped by your implementation using
dump_params()
does not. Is this a feature or a bug? It seems to me that it breaks compatibility between the two libraries.Since this file seems static it is simple enough to copy it over from the original paper into the folder with the evotuned weights, but I would find it desirable to have it there as a default.