MannLabs / alphabase

Infrastructure of AlphaX ecosystem
https://alphabase.readthedocs.io
Apache License 2.0
28 stars 8 forks source link

dynamic generation of SMILES for PTMs #199

Closed boopthesnoot closed 1 month ago

boopthesnoot commented 2 months ago

You can find the description and examples in docs/nbs/tutorial_smiles.ipynb

review-notebook-app[bot] commented 2 months ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

GeorgWa commented 2 months ago

I would suggest to integrate the data with the modification.tsv and the amino_acids.yaml: https://github.com/MannLabs/alphabase/blob/737f25c79c34de5182140543bf887ab61d7e53d5/alphabase/constants/const_files/amino_acid.yaml

So we don't have to add a new constants folder and need to have double book keeping of modification names. I think we can add two keys for each entry: sum/composition and smiles.

In the modifications.tsv you can just add a smiles column.

GeorgWa commented 2 months ago

@boopthesnoot Have a look at #200. I created an extra and a decorator for rdkit.

boopthesnoot commented 2 months ago

I would suggest to integrate the data with the modification.tsv and the amino_acids.yaml: https://github.com/MannLabs/alphabase/blob/737f25c79c34de5182140543bf887ab61d7e53d5/alphabase/constants/const_files/amino_acid.yaml

So we don't have to add a new constants folder and need to have double book keeping of modification names. I think we can add two keys for each entry: sum/composition and smiles.

In the modifications.tsv you can just add a smiles column.

@GeorgWa But the smiles in the modifications.tsv will be a mess, some of them will be AA's with PTMs, some of them will be terminal modifications only, without the AA, and we would still have to store which is which somewhere. By adding a key for each of the AAs in amino_acids.yaml we'll still have double bookkeeping of the atomic composition because we can infer it from SMILES. Ofc it would mean having a rdkit dependency for the whole package x)

GeorgWa commented 2 months ago

@GeorgWa But the smiles in the modifications.tsv will be a mess, some of them will be AA's with PTMs, some of them will be terminal modifications only, without the AA, and we would still have to store which is which somewhere. By adding a key for each of the AAs in amino_acids.yaml we'll still have double bookkeeping of the atomic composition because we can infer it from SMILES. Ofc it would mean having a rdkit dependency for the whole package x)

We could resolve this by looking up the localizer @Any N-Term. Alternatively we can also introduce a second column location = {'N','C','AA'} which would use dynamic or fixed smiles depending of the value.

In alphabase the modification names likeDimethyl@K are the primary keys across all applications. I think this primary key should only be defined once. Furthermore, the master record in modifications.tsv is updated automatically from unimod if more modifications are added. This way everything will stay in sync.

jalew188 commented 2 months ago

@GeorgWa But the smiles in the modifications.tsv will be a mess, some of them will be AA's with PTMs, some of them will be terminal modifications only, without the AA, and we would still have to store which is which somewhere. By adding a key for each of the AAs in amino_acids.yaml we'll still have double bookkeeping of the atomic composition because we can infer it from SMILES. Ofc it would mean having a rdkit dependency for the whole package x)

We could resolve this by looking up the localizer @Any N-Term. Alternatively we can also introduce a second column location = {'N','C','AA'} which would use dynamic or fixed smiles depending of the value.

In alphabase the modification names likeDimethyl@K are the primary keys across all applications. I think this primary key should only be defined once. Furthermore, the master record in modifications.tsv is updated automatically from unimod if more modifications are added. This way everything will stay in sync.

Yes, I think we should use only one PTM and AA defination file to avoid ambiguity in the future.

jalew188 commented 2 months ago

I would suggest to integrate the data with the modification.tsv and the amino_acids.yaml: https://github.com/MannLabs/alphabase/blob/737f25c79c34de5182140543bf887ab61d7e53d5/alphabase/constants/const_files/amino_acid.yaml

So we don't have to add a new constants folder and need to have double book keeping of modification names. I think we can add two keys for each entry: sum/composition and smiles.

In the modifications.tsv you can just add a smiles column.

We should use aa.tsv instead of aa.yaml for AAs, similar to modification.tsv

GeorgWa commented 1 month ago

I just catched that the dtype of unimod column in the modification.tsv changed to float. Can we move back?