Open zyxue opened 6 years ago
Thank you for reaching out!
The boundaries is indeed a little bit artificial. However, many of them are far from their supposed group for a reason. For the histine you mention, if you compare it to other charge AA, they are in fact quite different. In fact,
Histidine's pKa can easily be perturbed by its surroundings, e.g. by the surrounding residues in an enzyme active site, which makes it highly functionally versatile, one of the manifestations of its functional and chemical versatility being its ability to behave both as a polar/charged amino acid, as well as a hydrophobic residue.
Therefore, that might be the reason why we see H closer to uncharge residue. That's also why having a representation in continuous space could be useful.
For you technical questions,
model = gensim.models.Word2Vec(sentences, size=1500, window=10, workers=8, sg=1)
Hope these answer your concerns.
Thanks for your reply!
I am confused by the size
parameter, in your code, it says
# size: layer of neural net/ dimension, we set it to 20 because we only have 20 voca
but you call it with 1500
model = gensim.models.Word2Vec(sentences, size=1500, window=10, workers=8, sg=1)
I understand it as the size of the learned vectors per amino acid, then it shouldn't be larger than 20 since there are only 20 AAs. Is that correct?
ah, now I remember. The size 1500 is when I am looking at building embedding for k-mer instead of single AA. When we generate the graph for single AA, the model should output an embedding of size 20 per the comment.
Since you mentioned, I am also very interested in k-mer, how'd the experiment go for k-mer, then?
Some experiment figure is here:
https://github.com/WesleyyC/Amino-Acid-Embedding/tree/master/Figure
so you should be able to open them in MATLAB, but we didn't continue this project, so the context part is not well maintained.
Thanks! What's your main conclusion, then? I am doing something similar with nr (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz) database, I don't see any particular pattern among AAs. My thought is that if there is no strong pattern (if any at all), then it suggests that in nature, any AA is likely to be neighbours of any other AA, so at the character level, there isn't much difference among them. Would you agree with that?
Sorry for the late response. Miss the notification for some reasons.
For single AA embedding, we do see strong pattern regarding their biochemical property. In addition, we have computed a distance matrix using their embedding and compared it to the BLOSUM matrix. It seems that their are highly correlated.
For k-mer AA embedding, we do see pattern in our graph but we are not sure if it's the artifact of the way we generate k-mer or it's truly a pattern.
The boundaries on your front page look quite artificial to me, e.g. H is much closer to the uncharged AAs than to the charged AAs. Do you have any comment on why, please? I also have a few technical questions:
Thank you.