DeepRank / deeprank2

An open-source deep learning framework for data mining of protein-protein interfaces or single-residue variants.
https://deeprank2.readthedocs.io/en/latest/?badge=latest
Apache License 2.0
40 stars 10 forks source link

Allow for >20 types in `AminoAcid` class #198

Closed DTRademaker closed 5 months ago

DTRademaker commented 2 years ago

Hello,

I am working with the AminoAcid class (deeprankcore/models/amino_acid.py) because the Residue class (deeprankcore/models/structure.py) requires it, and running into an issue. The common 20 amino-acids are nicely initialized in deeprankcore/domain/amino_acid.py, but I often also have non-standard or unknown amino-acids. It is quite easy to make new amino-acid with new properties, however, the deeprankcore/models/amino_acid.py script is hardcoded to only have one-hot-encoding of length 20, which leaves no room for other amino-acids. Can this one-hot length be changed somehow? Or at least be 21 long, so we have always have a spot for 'other'?

Daniel

DaniBodor commented 2 years ago

Hi Daniel, I can look into this. Having a 21st for 'other' should definitely be doable, but I will check if it can be open ended.

Just out of curiosity, how is it possible that you have an amino-acid that is not part of the standard 20?

DaniBodor commented 2 years ago

Also, is there a limited list of additional amino acids that you get (or at least main ones)? If so, maybe it makes sense to add them to the default list in addition to allowing for extra user-defined ones.

DaniBodor commented 2 years ago

A quick and dirty solution is to change a = numpy.zeros(20) to a = numpy.zeros(len(amino_acids)) in the one_hot definition of the AminoAcid class (in deeprankcore/models/amino_acid.py). This assumes that you have added your new amino acids to the list at the end of the module.

image

Going forward, we might want to make this more sustainable by creating a class that contains each potential amino acid (similar to what we do for AtomicResidue class.

DTRademaker commented 2 years ago

Hi Daniel, I can look into this. Having a 21st for 'other' should definitely be doable, but I will check if it can be open ended.

Just out of curiosity, how is it possible that you have an amino-acid that is not part of the standard 20?

DTRademaker commented 2 years ago

"Just out of curiosity, how is it possible that you have an amino-acid that is not part of the standard 20?" Well there are many more non-canonical amino-acids, for example Selenocysteine (also known as the "21st proteinogenic amino acid"). In some cases aminoacids form covalent bonds with other molecules such as phosphates or sugar groups and researchers might want to label them differently. "A quick and dirty solution..." Yes this is possible, and normally I would also do it like this :), but not desired in the case I am working on. I want to publish and thereby share code to people who have installed the 'standard' deeprankcore code.

After thinking about it, I think this function does not belong to the AminoAcid class at all, but should be incorporated in the features section, there researchers could add as many extra non-canonical aminoacids as they want

DaniBodor commented 2 years ago

In some cases aminoacids form covalent bonds with other molecules such as phosphates or sugar groups and researchers might want to label them differently.

I would say modified amino acids are not different amino acids. I do agree it would be nice to have an entire new feature module for PTMs, I had thought about that already. Probably not a priority right now, but a nice addition for some point in the future.

Well there are many more non-canonical amino-acids, for example Selenocysteine (also known as the "21st proteinogenic amino acid").

As for the non-canonical amino acids, I am not aware of many. I think there is the one you mentioned and some Lysine variant that only exists in bacteria. These are both quite rare (at least in human, but I think across biology) and very similar to a canonical amino acid counterpart and I doubt much information gets lost by labeling them as their canonical counterpart. Very few projects consider more than 20 core amino acids, unless the non-canonical ones are specifically being studied.

If you feel it's important to have these as part of the main code before publication, feel free to add them to the Amino Acid module (now in deeprankcore/models/amino_acid.py) and create a pull request for this.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 2 years ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

DaniBodor commented 1 year ago

@DTRademaker , I just realized that in fact the 2 main (only?) non-canonical amino acids (selenocysteine and pyrrolysine) were already defined in aminoacidlist.py module, just not part of the amino_acid list itself. I have now added them in PR #272.

I noticed that these are currently indexed (one-hot encoded) as their canonical counterparts. Would it be better to give them their own one-hot encoding or is it ok to keep them as is? (I guess ideal would be to make this an option, but not sure it's worth the effort to program it in).

DaniBodor commented 1 year ago

OK, so including Sec and Pyl actually leads to problems during parsing. For now I will not spend time trying to resolve this. Maybe in the future we can look into it more closely if there is a direct application for it.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.