Closed edikedik closed 1 year ago
In general I agree with @edikedik.
I would propose to use the _chem_comp.type
property from the PDB Chemical Component Dictionary as it is already done in filter_nucleotides()
. I would be happy to implement that as I already wrote the relevant scripts for filter_nucleotides()
.
Furthermore, as @edikedik suggested we could add a filter_canonical_amino_acids()
and for completeness filter_canonical_nucleotides()
.
This sounds like a good solution to me. Using the _chem_comp.type
should give a comprehensive list of amino acids. I am also in favor of the filter_canonical_xxx()
versions. @edikedik Would it be OK for you, if @tomtomhdx implements this functionality since, as he wrote, he already has experience in parsing the components.cif
?
3. Indeed, I suspect you planned it as such, as you rely on it in the
filter_backbone
@edikedik You are correct. This could be problematic, if e.g. dihedral angles are calculated for a structure containing an amino acid ligand. Do you know whether amino acid ligands use ATOM
or HETATM
records in PDB entries? Could you give an example PDB ID?
Would it be OK for you, if @tomtomhdx implements this functionality
Sure thing. To complete the list of supported standard (macromolecular) polymers, @tomtomhdx could also add filter_carbohydrates
implemented similarly.
As for the filter_backbone
, I believe the valid solution would be to leave it as it is but provide the function discriminating between polymer vs. non-polymer entities — something like filter_polymer_xxx
where xxx
is the polymer type ("peptide", for instance). Or one could allow providing the polymer type as the argument. This would allow the filtering of backbone atoms that actually belong to the protein chain and solve the issue of amino acid ligands.
@padix-key As for the latter, there are plenty. In general, typing any 3-letter code (ALA, for instance) results, among other things, in a list of entries where the molecule is used as a standalone ligand. Indeed, the ones I viewed used HETATM
records.
I'll throw in an example for the filter_polymer
function I suggested above.
import biotite.structure as bst
from functools import partial
def filter_polymer(array, min_size=2, pol_type='peptide'):
"""
Parameters
----------
array : AtomArray or AtomArrayStack
The array to filter.
min_size : int
The minimum number of monomers.
pol_type : str
The polymer type, either "peptide", "nucleotide", or "carbohydrate".
Abbreviations are supported: "p", "pep", "n", etc.
Returns
-------
filter : ndarray, dtype=bool
This array is `True` for all indices in `array`, where atoms belong to
consecutive polymer entity having at least `min_size` monomers.
"""
split_idx = bst.check_res_id_continuity(array)
check_pol = partial(_is_polymer, min_size=min_size, pol_type=pol_type)
bool_idx = map(
lambda a: np.full(len(a), check_pol(bst.array(a)), dtype=bool),
np.split(array, split_idx)
)
return np.concatenate(list(bool_idx))
def _is_polymer(array, min_size, pol_type):
if pol_type.startswith('p'):
filt_fn = bst.filter_amino_acids
elif pol_type.startswith('n'):
filt_fn = bst.filter_nucleotides
elif pol_type.startswith('c'):
# filt_fn = bst.filter_carbohydrates
pass
else:
raise ValueError(f'Unsupported polymer type {pol_type}')
mask = filt_fn(array)
return bst.get_residue_count(array[mask]) >= min_size
I think filter_polymer()
has a good use case, especially if a certain minimum length is required. Could you open a PR? Did you already test this function, especially whether np.split()
works for non-ndarray
objects?
I think
filter_polymer()
has a good use case, especially if a certain minimum length is required. Could you open a PR? Did you already test this function, especially whethernp.split()
works for non-ndarray
objects?
Sure. I tested it on protein molecules, it works as expected, although it would benefit from more rigorous testing. As it depends on other filtering functions discussed above, should I maybe wait until @tomtomhdx implements them? If you aren't planning to add filter_carbohydrates
, the waiting is unnecessary.
I think there is no need to hurry. @tomtomhdx Would you also like to
implement filter_carbohydrates()
?
Ivan Reveguk @.***> schrieb am Sa., 17. Sep. 2022, 14:40:
I think filter_polymer() has a good use case, especially if a certain minimum length is required. Could you open a PR? Did you already test this function, especially whether np.split() works for non-ndarray objects?
Sure. I tested it on protein molecules, it works as expected, although it would benefit from more rigorous testing. As it depends on other filtering functions discussed above, should I maybe wait until @tomtomhdx https://github.com/tomtomhdx implements them? If you aren't planning to add filter_carbohydrates, the waiting is unnecessary.
— Reply to this email directly, view it on GitHub https://github.com/biotite-dev/biotite/issues/424#issuecomment-1250064432, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGWAS6MBWY5KLT4K3DJNQKLV6W35JANCNFSM6AAAAAAQOKGMPM . You are receiving this because you were mentioned.Message ID: @.***>
@tomtomhdx Would you also like to implement
filter_carbohydrates()
?
Yes will do!
I've tested new functions, they work as expected. Thank you, @tomtomhdx !
Hi,
The function
filter_amino_acids
utilizes a hardcoded list of residues corresponding to the canonical alphabet.Although it seems sane, it may lead to unexpected results when non-canonical residues reside within the protein sequence. Indeed, in this case, function wrongly makes the protein chain discontinuous likely causing downstream errors.
I see three possible solutions.
filter_canonical_amino_acids
clearly states the functionality.filter_backbone
. However, there are cases when amino acids are used as ligands. Thus, an extension/improvement, I believe, should rely on both protein continuity and maybe utilize the extended list I suggested above. For instance, an empirical workflow is to break the chain into continuous pieces and treat any residue bounded by residues from the hardcoded list (such as_ext_aa_list
) as an amino acid.Let me know what you think. If you agree that, at least, there is a problem, and we come to some consensus, I could implement it via PR. Thanks.