Consistent nomenclature

ns-rse commented 2 years ago

In seeking to shorten code in issue #113 some inconsistent nomenclature was identified. The following from @smesnage clarifies the changes that need making to make nomenclature consistent in output.

Damn, you opened a Pandora box! Your comment made me realise that there is a lot of inconsistency in the way we describe modifications in the output of the script and/or the choices offered for the Jupyter notebook search. I think the modifications below would be welcome. PLease see doc attached because some changes don't come across in this box!!

1) Sodium searches could be amended

it should be "Sodium adduct (Na+)" in the Jupyter notebook instead of "Sodium"
output is ok as Na+ gm-AEJAA; NO CHANGE REQUIRED

2) Potassium searches could be amended

it should be "Potassium adduct (K+)" in the Jupyter notebook instead of "Potassium"
output is ok as K+ gm-AEJAA; NO CHANGE REQUIRED

3) Anhydro searches could be amended

it should be "Anhydro MurNAc (Anh)" in the Jupyter notebook instead of "Anh";
output should be gm-AEJAA (Anh); NO CHANGE REQUIRED

4) de-acetylation searches could be amended

it should be "De-acetylation (-Ac)" in the Jupyter notebook instead of "DeAc";
output should be gm-AEJAA (-Ac) instead of gm-AEJAA (DeAc);

5) Combined de-acetylation/anhydro searches could be amended

it should be "De-acetylation and anhydroMurNAc (-Ac, Anh)" in the Jupyter notebook instead of "DeAc_Anh";
output should be gm-AEJAA (-Ac, Anh) instead of gm-AEJAA (DeAc_Anh);

6) Nude searches could be amended

it should be "Extra disaccharide (gm~)" in the Jupyter notebook instead of "Nude";
output should be gm~gm-AEJAA instead of gm-gm-AEJAA to distinguish between peptide bonds (-) and glycan bonds (~);

7) Decay searches could be amended

it should be "GlcNAc loss" in the Jupyter notebook instead of "Decay";
output should be m-AEJAA (-GlcNAc) instead of m-AEJAA;

8) Amidation searches could be amended

it should be "Amidation (NH2)" in the Jupyter notebook instead of "Amidated";
output should be gm-AEJAA (NH2) instead of gm-AEJAA (Amidated);

9) Amidase searches could be amended

it should be "Loss of gm (Amidase product)" in the Jupyter notebook instead of "Amidase";
output should be Lac-AEJAA (Amidase product) instead of gm-AEJAA (Amidase product)

10) Double Anhydro searches could be amended

it should be "Double anhydroMurNAc (Double Anh)" in the Jupyter notebook instead of "Double_Anh";
output should be gm-AEJAA (Double Anh) instead of gm-AEJAA (Double_Anh)

11) Multimers, multimers_Glyco and Multimers_Lac could be changed:

change "multimers_Glyco" for "Multimers_Glyco " (add a capital);
check that output is gm-AEJAA instead of GM-AEJAA (we changed the syntax to avoid confusion between g=GlcNAc and G=Glycine and m=MurNAc and M=methionine; crosslink between 2 monomers should appear as "=" so gm-AEJAA-gm-AEJA should become gm-AEJAA=gm-AEJA; this is important to predict fragmentation products.

12) O-acetylation searches could be amended

it should be "O-acetylation (+Ac)" in the Jupyter notebook instead of "O-Acetylated";
output should be gm-AEJAA (+Ac)

This comment has been updated on 14 Oct 2022 to take into account discussions (and valid points!) made by Brooks recently.

Cheers

Amendments to PGFinder semantic.docx Originally posted by @smesnage in https://github.com/Mesnage-Org/pgfinder/issues/113#issuecomment-1172233295

TheLostLambda commented 2 years ago

Hello all!

I've been working a bit with Marcel on the notation for deacetylated structures! Do these permutations fit with what @ns-rse and @smesnage were thinking?

Having this notation would certainly make my life writing the graph generator easier!

Here are the 16 permutations for a dimer if both sugars can be deacetylated like this:

gm-AEJAA=gm-AEJA
g(-Ac)m-AEJAA=gm-AEJA
gm(-Ac)-AEJAA=gm-AEJA
g(-Ac)m(-Ac)-AEJAA=gm-AEJA
gm-AEJAA=g(-Ac)m-AEJA
g(-Ac)m-AEJAA=g(-Ac)m-AEJA
gm(-Ac)-AEJAA=g(-Ac)m-AEJA
g(-Ac)m(-Ac)-AEJAA=g(-Ac)m-AEJA
gm-AEJAA=gm(-Ac)-AEJA
g(-Ac)m-AEJAA=gm(-Ac)-AEJA
gm(-Ac)-AEJAA=gm(-Ac)-AEJA
g(-Ac)m(-Ac)-AEJAA=gm(-Ac)-AEJA
gm-AEJAA=g(-Ac)m(-Ac)-AEJA
g(-Ac)m-AEJAA=g(-Ac)m(-Ac)-AEJA
gm(-Ac)-AEJAA=g(-Ac)m(-Ac)-AEJA
g(-Ac)m(-Ac)-AEJAA=g(-Ac)m(-Ac)-AEJA

I'll also attach the dimer file with this updated notation that Marcel sent me here: modifiedDimers.txt

Let me know if I should keep rolling with this notation or if you're keen to tweak it still!

ns-rse commented 2 years ago

Thanks @TheLostLambda unfortunately its not an area I can comment on as I lack the domain knowledge.

smesnage commented 2 years ago

I'm afraid the nomenclature you used is the correct one!

This being said, I don't think all these structures can exist so we need to take it easy on the systematic prediction of dimers. I don't think deacetylated MurNAc exist [m(-Ac)], unless you have shown that gm(-Ac)-AEJA or g(-Ac)m(-Ac)-AEJA exist.

I hope this answers your question.

TheLostLambda commented 2 years ago

Just to document this here, once I start working on this I'll aim to implement @smesnage 's list of tweaks, but also move the modification-containing brackets to immediately follow the group being modified where possible.

I suppose the only thing that makes this a bit tricky is deciding which of these permutations to generate, so that might require a few more changes in the way PGFinder selectively enables modifications.

I suppose, particularly with the long format, we can just generate both possibilities for monomers with something like a deacetylation:

g(-Ac)m-AEJAA
gm(-Ac)-AEJAA

A sanity check from @smesnage , it's impossible to tell from MS1 if it's the GlcNAc or the MurNAc has been modified in this case (at least without additional biological insight?)

TheLostLambda commented 2 years ago

Additionally, for all involved: an assumption that my graph generator has operated on for a while is that multimers connected through the sugar chain are linked with a tilde ~ and chains that are cross-linked through their peptide stems are denoted with an equal sign =.

An example above, gm-gm-AEJAA would become gm~gm-AEJAA.

Would I be correct in saying these notation changes never made it out of my forked version of PGFinder? Are we happy in implementing those additional, disambiguating notation changes here?

TheLostLambda commented 2 years ago

As a final note, the ideal situation is that every named structure is as unambiguous as possible (perhaps some day we'll need to think about all of the different ways things can be cross-linked), and that where mass ambiguities / isomers exist, that PGFinder would output a row for each possibility. Let me know if I've got the right long-term vision there :)

smesnage commented 2 years ago

Just to document this here, once I start working on this I'll aim to implement @smesnage 's list of tweaks, but also move the modification-containing brackets to immediately follow the group being modified where possible. I suppose the only thing that makes this a bit tricky is deciding which of these permutations to generate, so that might require a few more changes in the way PGFinder selectively enables modifications. I suppose, particularly with the long format, we can just generate both possibilities for monomers with something like a deacetylation:
g(-Ac)m-AEJAA
gm(-Ac)-AEJAA
A sanity check from @smesnage , it's impossible to tell from MS1 if it's the GlcNAc or the MurNAc has been modified in this case (at least without additional biological insight?)

MS1 is your best option to check where the deacetylation is; there is always in-source decay so you just have to look if you loose 203 (GlcNAc) or 161 (GlcN). It is super important to look at this to make sure you only check MS2 with ONE muropeptide candidate (half of the work!!)

smesnage commented 2 years ago

Additionally, for all involved: an assumption that my graph generator has operated on for a while is that multimers connected through the sugar chain are linked with a tilde ~ and chains that are cross-linked through their peptide stems are denoted with an equal sign =. An example above, gm-gm-AEJAA would become gm~gm-AEJAA. Would I be correct in saying these notation changes never made it out of my forked version of PGFinder? Are we happy in implementing those additional, disambiguating notation changes here?

This is an unnecessary complication because the "extra gm" can only be connected to another sugar residue. Please use a normal hyphen!!

smesnage commented 2 years ago

As a final note, the ideal situation is that every named structure is as unambiguous as possible (perhaps some day we'll need to think about all of the different ways things can be cross-linked), and that where mass ambiguities / isomers exist, that PGFinder would output a row for each possibility. Let me know if I've got the right long-term vision there :)

Definitely, but there are linkages that cannot exist. I'm tempted to cross the bridge when we get there: if we find MS1 matches that cannot agree with the simplest MS2 fragmentation list, then we will have to be more creative and think about non canonical linkages. In fact, just made me think that this could explain some incomplete MS2 annotations by Byos (with residues on the glu for example?). That's worth bearing in mind!

TheLostLambda commented 2 years ago

MS1 is your best option to check where the deacetylation is; there is always in-source decay so you just have to look if you loose 203 (GlcNAc) or 161 (GlcN). It is super important to look at this to make sure you only check MS2 with ONE muropeptide candidate (half of the work!!)

This is quite good to know! Certainly makes the downstream analysis easier and sounds like one of the many improvements that could be introduced to PGFinder farther down the line :)

This is an unnecessary complication because the "extra gm" can only be connected to another sugar residue. Please use a normal hyphen!!

This is a good point! I needed the tilde before when the sugars and the amino acids were ambiguous — something like GM-GM-AEJAA was a bit confusing as to if the middle GM was a stem or the chain! Double checking you're still alright with - for chain bonds and = for stem bonds?

Definitely, but there are linkages that cannot exist. I'm tempted to cross the bridge when we get there: if we find MS1 matches that cannot agree with the simplest MS2 fragmentation list, then we will have to be more creative and think about non canonical linkages. In fact, just made me think that this could explain some incomplete MS2 annotations by Byos (with residues on the glu for example?). That's worth bearing in mind!

Could certainly be worth closing the loop between the MS1 and MS2 at some point, where data from the MS2 could be fed back into PGFinder to gather some more data that the MS2 program wants as and when it's needed. Another vision for the farther future though!

TheLostLambda commented 2 years ago

Additionally, for all involved: an assumption that my graph generator has operated on for a while is that multimers connected through the sugar chain are linked with a tilde ~ and chains that are cross-linked through their peptide stems are denoted with an equal sign =. An example above, gm-gm-AEJAA would become gm~gm-AEJAA. Would I be correct in saying these notation changes never made it out of my forked version of PGFinder? Are we happy in implementing those additional, disambiguating notation changes here?

This is an unnecessary complication because the "extra gm" can only be connected to another sugar residue. Please use a normal hyphen!!

@smesnage I've done some more thinking on this and I think I would certainly prefer using the ~ over the - if that's something you could be convinced of. While you're right that structures like gm-gm-AEJA are not chemically ambiguous, the - has two different meanings here: it represents a chain-stem linkage, but also a monomer-monomer linkage. To determine which role - is playing (as is needed to split this dimer into monomers), a closer inspection of the surrounding context is required. This means significant additional code and seems to be an unnecessary overloading of the - symbol.

If we use ~ and = for dimer separation, it's trivial to split up monomers for graph generation, but it's much more work if - has two distinct meanings, discernible only via context. Both the computer and (I think) many users might find identifying gm~gm-AEJA~gm-AEJAA as a trimer easier than gm-gm-AEJA-gm-AEJAA. Using these three symbols, ~ is always a sugar-sugar bond, = is always a peptide-peptide bond, and - is always a sugar-peptide bond.

Let me know if you'd be okay with swapping - for ~ when showing dimer chain linkages! If you are particularly keen on keeping - over ~ despite the multiple meanings, I'm happy to make that work as well :)

smesnage commented 2 years ago

Using these three symbols, ~ is always a sugar-sugar bond, = is always a peptide-peptide bond, and - is always a sugar-peptide bond.

This is an argument that makes sense! I saw the addition of the ~ sign as a complication but in hindsight it's not the case. So yes, you're right, it makes sense to use distinct symbols for distinct bonds.

bobturneruk commented 1 year ago

@ns-rse may (will) have questions given the long discussion.

TheLostLambda commented 1 year ago

Here is my language I developed for describing PG structures: Could be relevant for this issue! peptidoglycan-overview

bobturneruk commented 1 year ago

There have definitely been some developments on this @ns-rse .

TheLostLambda commented 1 year ago

Another note-to-self: Is the whole |0 / |1 / |2 something that we need to keep around?

If multimers are separated by ~ or =, then I don't think we really need those numbers at the end anymore?

smesnage commented 1 year ago

You're in charge! What you suggest seems reasonable to me.

On Thu, 22 Jun 2023, 10:47 Brooks Rady, @.***> wrote:

Another note-to-self: Is the whole |0 / |1 / |2 something that we need to keep around?

If multimers are separated by ~ or =, then I don't think we really need those numbers at the end anymore?

— Reply to this email directly, view it on GitHub https://github.com/Mesnage-Org/pgfinder/issues/114#issuecomment-1602338402, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQY5HCEVBU4XCRQIRORG37DXMQIBJANCNFSM52MQIWPQ . You are receiving this because you were mentioned.Message ID: @.***>

ns-rse commented 1 year ago

A naive comment here as this is based on my poor understanding of the discussion, but the retention of the numerical suffixes might serve a purpose in allowing users to quickly identify/distinguish the number of units in the multimers (if I've understood correctly what they represent!).

Otherwise there is a cognitive overhead in looking at the nomenclature and counting with scope for messing up between ~ and - (I say this having recently started wearing glasses as my eyes are aging!).

TheLostLambda commented 1 year ago

That's a good point! Either way I'm interested in removing them from the mass dictionaries (since those are manually put together), but maybe we add a column that contains the oligomer number!

As a proper column you could sort and filter things too! Essentially more moving to "long format" / tidy data I suppose!

TheLostLambda commented 1 year ago

Maybe we should also update the allowed_modifications.csv file to contain a description of each setting (like the docs contain) that can be displayed in UI tooltips!

TheLostLambda commented 1 year ago

Adding another note before trying to tackle this today — there are more nuances to developing a "PG Structural Language" than are probably in scope here... A recent issue is that the lovely tilde (~) character is an escape character in Excel...

I think that settling on a proper fix to this is out of scope for now and something I can develop separately from PGFinder at PMI. For now, I'm going to try to slay the original dragon here: making modification names consistent!

I also eventually plan on making the PG language (Plang?!) expressive enough to communicate structures unambiguously (given MS2 data), or more ambiguously (like is done by PGFinder using MS1 data). The TL;DR here is I think keeping the modification list in brackets at the end of a structure is fine for now.

TheLostLambda commented 1 year ago

9) Amidase searches could be amended

it should be "Loss of gm (Amidase product)" in the Jupyter notebook instead of "Amidase";

output should be Lac-AEJAA (Amidase product) instead of gm-AEJAA (Amidase product)

Is the Lac here really correct? Most of the figures I can find only imply that the Lac stays on the MurNAc and that just AEJAA would be correct here. I'll check the code I suppose, but I know @smesnage said both were possible at some point?

TheLostLambda commented 1 year ago

Another question for @smesnage

10) Double Anhydro searches could be amended

it should be "Double anhydroMurNAc (Double Anh)" in the Jupyter notebook instead of "Double_Anh";

output should be gm-AEJAA (Double Anh) instead of gm-AEJAA (Double_Anh)

That example doesn't really make sense to me, since there is only one MurNAc in that structure... Is this modification only ever applied to dimers and beyond?

TheLostLambda commented 1 year ago

I'd also like to propose something more generic / reusable like (2Anh) or, at the moment, (Anh, Anh) instead of something like (Double Anh). Let me know what others think!

TheLostLambda commented 1 year ago

And I think a final science question for @smesnage :

What does Multimers_Lac actually mean / look like? I assume Multimers_Glyco is building dimers and beyond by forming only glycosidic bonds between them?

But how is the Lac involved here?

TheLostLambda commented 1 year ago

This is the list I've got at the moment (added some personal tweaks from the list at the top of the issue):

Let me know @smesnage what to do about the Multimers_Lac and if you hate anything else I've done there!

smesnage commented 1 year ago

Hi,

Sorry I could not reply earlier. Here are all the answers:

AMIDASE SEARCHES You're right, the lactly group stays on the MurNAc; we saw lactyl-peptide fragments in some searches, but these should be the product of enzymes that would be etherases, not described yet. Forget about them, stick to the description of amidase products WITHOUT the lactyl group. Well spotted!

DOUBLE ANHYDRO I vaguely remember that we search those based on the Pseudomonas paper in the literature. I believe they were enabling us to pick up dimers with 2 anhydro. The example I used is nonsensical. I would remove this modification, it's useless. Next comment is npo longer an issue.

MULTIMER LAC Because the formation of dimers is hard coded, this is only used for the formation of dimers when you make a search with lacyl-peptides. This was for the Elife paper, I did beta-elimination of mutanolysin digestion products (you treat with ammonia for 5h @ 37°C, then neutralise with acetic acid, this generates lactyl-peptides). If you're only interested in corsslinks and not bothered about sugars, it makes like simpler. Need to keep this option!

Final table looks good, just one inconsistency: GlcNAc loss Loss of disaccahride Should be Loss of GlcNAc?

Let me know if you have any questions!

TheLostLambda commented 1 year ago

@smesnage Double-checking, the Amidation is only roughly a -1 mass because the NH2 is replacing an OH, right? Something like a glutamate to a glutamine?

smesnage commented 1 year ago

Correct. Can happen on Glu or mDAP St

On Sun, 27 Aug 2023, 12:29 Brooks Rady, @.***> wrote:

@smesnage https://github.com/smesnage Double-checking, the Amidation is only roughly a -1 mass because the NH2 is replacing an OH, right? Something like a glutamate to a glutamine?

— Reply to this email directly, view it on GitHub https://github.com/Mesnage-Org/pgfinder/issues/114#issuecomment-1694642531, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQY5HCA5NCU4V6DWWXSCZL3XXMVRPANCNFSM52MQIWPQ . You are receiving this because you were mentioned.Message ID: @.***>

TheLostLambda commented 1 year ago

We've stomped Pandora's box shut for the moment, but the longer I look at this, the more inconsistencies crop up... It's certainly better than before, and I'd argue good enough for now!

Mesnage-Org / pgfinder

Consistent nomenclature #114