JJAlmagro / subcellular_localization

55 stars 27 forks source link

Can you please publish you preprocessing in the data #2

Open chachalin opened 5 years ago

meichangsu1 commented 5 years ago

could you please publish your preprocess of the data

JJAlmagro commented 5 years ago

Which part of the preprocessing do you refer? Do you mean the encoding from the amino acid sequence to BLOSUM62 or profiles?

meichangsu1 commented 5 years ago

yes

chachalin commented 5 years ago

Thank you for your response!My problem was solved!!!

meichangsu1 commented 5 years ago

and i want to know why use BLOSUM62 or profiles instead of one-hot encoding of the source protein ,thank you very much

meichangsu1 commented 5 years ago

Thank you for your response!My problem was solved!!!

how do you solved your problem,I am very interested in it

chachalin commented 5 years ago

Thank you for your response!My problem was solved!!!

how do you solved your problem,I am very interested in it

As you know,using BLOSUM62 and it's profiles.But you can ask the author that why he make it use.

meichangsu1 commented 5 years ago

Thank you for your response!My problem was solved!!!

how do you solved your problem,I am very interested in it

As you know,using BLOSUM62 and it's profiles.But you can ask the author that why he make it use.

那请问您是怎么获得它的 BLOSUM62和protein profiles的呢,生物信息小白只知道BLOSUM62是用来做序列相似度比较的

JJAlmagro commented 5 years ago

To create the protein profiles you can use PROFILpro (http://download.igb.uci.edu). I can add later the function that I used to encode the amino acid sequence into a matrix (the input to the neural network) if that is what you want.

The disadvantage of using one-hot encoding is that this assumes that all the amino acids are equally different between each other. However, this is not the case as some amino acids share some properties and therefore substituting an amino acid with another one with similar properties will have a smaller effect on the protein function or structure. Therefore, we include this information by encoding the protein using BLOSUM62 or protein profiles, as similar amino acids will have a similar representation in these matrices.

meichangsu1 commented 5 years ago

yes ,that is just what i want,if you can add it,Thank you very much!

murakdar commented 4 years ago

@JJAlmagro I would also greatly appreciate seeing how protein sequences are encoded into a matrix form to be used as input for the neural network.

A-Alaa commented 4 years ago

To create the protein profiles you can use PROFILpro (http://download.igb.uci.edu). I can add later the function that I used to encode the amino acid sequence into a matrix (the input to the neural network) if that is what you want.

The disadvantage of using one-hot encoding is that this assumes that all the amino acids are equally different between each other. However, this is not the case as some amino acids share some properties and therefore substituting an amino acid with another one with similar properties will have a smaller effect on the protein function or structure. Therefore, we include this information by encoding the protein using BLOSUM62 or protein profiles, as similar amino acids will have a similar representation in these matrices.

@JJAlmagro I would appreciate providing the public the function that you used to convert all protein sequences into 400x20 for each protein? Thanks in advance!

A-Alaa commented 4 years ago

@JJAlmagro It is mentioned in the paper under a figure:

proteins shorter than 1000 amino acids are padded from the middle, so the N-terminus and C-terminus align. Proteins longer than 1000 amino acids have the middle part removed.

So this approach wasn't only used for the purpose of visualization, but also for training part to unify the proteins lengths, whether you choose 1000 or 400 as a global length, right? If so, IMO the preprocessing section in the paper is missing such a statement.

budzakj commented 2 years ago

Did anyone ever get the function on converting the encoded amino acid sequences into a matrix? @JJAlmagro I am not sure if I can easily use DeepLoc for retraining with novel data without this function. Thank you!

BranchW commented 2 years ago

Did anyone ever get the function on converting the encoded amino acid sequences into a matrix? @JJAlmagro I am not sure if I can easily use DeepLoc for retraining with novel data without this function. Thank you!

Have you solved this problem? I need help, too.

yuzhiguo07 commented 2 years ago

Could you please share the code regarding how you crop the protein sequences which longer then 1000? Since the longest protein in the Deeploc dataset is 13100.