WesleyyC / Amino-Acid-Embedding

:microscope: Train an Amino Acid Embeddings (or a dragon?)
MIT License
17 stars 1 forks source link

Comments on the boundries #2

Open zyxue opened 6 years ago

zyxue commented 6 years ago

The boundaries on your front page look quite artificial to me, e.g. H is much closer to the uncharged AAs than to the charged AAs. Do you have any comment on why, please? I also have a few technical questions:

  1. What's the dimension of the learned vectors?
  2. Did you try visualization with a more straightforward dimensionality reduction technique, e.g. PCA?
  3. Do you have any thought on the result in the context of AA neighbors? I realize a major difference between language and AAs is that the meaning of a word is well characterized by its neighbors, so arbitrary concatenation of words won't make sense. But for AAs, any combination seems to possible. However, the resulting peptide may not be super useful.

Thank you.

WesleyyC commented 6 years ago

Thank you for reaching out!

The boundaries is indeed a little bit artificial. However, many of them are far from their supposed group for a reason. For the histine you mention, if you compare it to other charge AA, they are in fact quite different. In fact,

Histidine's pKa can easily be perturbed by its surroundings, e.g. by the surrounding residues in an enzyme active site, which makes it highly functionally versatile, one of the manifestations of its functional and chemical versatility being its ability to behave both as a polar/charged amino acid, as well as a hydrophobic residue.

Therefore, that might be the reason why we see H closer to uncharge residue. That's also why having a representation in continuous space could be useful.

For you technical questions,

  1. I don't remember what exactly do I use for generating the graph, but this is the line of code shown in the data folder: model = gensim.models.Word2Vec(sentences, size=1500, window=10, workers=8, sg=1)
  2. Yes, but I don't think I have the result. I remember it's not impressive and that's why we move on to t-SNE. I am happy to include your PCA result if you end up making one.
  3. Yes, but in the same token, you could put any words together but they won't make any sense (i.e. useful). However, I agree with you that the AA's context doesn't really define the AA, but instead define a context where the AA you want to predict will be in the center.

Hope these answer your concerns.

zyxue commented 6 years ago

Thanks for your reply!

I am confused by the size parameter, in your code, it says

# size: layer of neural net/ dimension, we set it to 20 because we only have 20 voca

but you call it with 1500

model = gensim.models.Word2Vec(sentences, size=1500, window=10, workers=8, sg=1)

I understand it as the size of the learned vectors per amino acid, then it shouldn't be larger than 20 since there are only 20 AAs. Is that correct?

WesleyyC commented 6 years ago

ah, now I remember. The size 1500 is when I am looking at building embedding for k-mer instead of single AA. When we generate the graph for single AA, the model should output an embedding of size 20 per the comment.

zyxue commented 6 years ago

Since you mentioned, I am also very interested in k-mer, how'd the experiment go for k-mer, then?

WesleyyC commented 6 years ago

Some experiment figure is here:

https://github.com/WesleyyC/Amino-Acid-Embedding/tree/master/Figure

so you should be able to open them in MATLAB, but we didn't continue this project, so the context part is not well maintained.

zyxue commented 6 years ago

Thanks! What's your main conclusion, then? I am doing something similar with nr (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz) database, I don't see any particular pattern among AAs. My thought is that if there is no strong pattern (if any at all), then it suggests that in nature, any AA is likely to be neighbours of any other AA, so at the character level, there isn't much difference among them. Would you agree with that?

WesleyyC commented 6 years ago

Sorry for the late response. Miss the notification for some reasons.

For single AA embedding, we do see strong pattern regarding their biochemical property. In addition, we have computed a distance matrix using their embedding and compared it to the BLOSUM matrix. It seems that their are highly correlated.

For k-mer AA embedding, we do see pattern in our graph but we are not sure if it's the artifact of the way we generate k-mer or it's truly a pattern.