Closed konstin closed 3 years ago
Hi @konstin ,
I'd say it's because we add a "start" amino-acid (aa) when integer-encoding the protein sequence before repping (actually, true to the original implementation, a "start" and a "stop" aa get added to either end of the sequence, but when using get_reps
the "stop" aa gets removed again).
The exact function in jax-unirep
where this happens is get_embedding
in utils.py
.
In the original implementation it's in get_rep
in unirep.py
.
Thanks!
I'm trying to use the per-residue embeddings of unirep and have obeserved that the hidden state sequence is one longer than the number of amino acids in the sequence. Judging from the
test_mLSTM1900
test, this is intended. Could you tell me why there is one extra hidden state?For reference, this is the minimized code I'm using (full source):