Call for feedback: New lipid naming convention

zakandrewking commented 4 years ago

There is a nice proposal from @michaelwitting at WormJam for a systematic naming convention for lipids. It's described here:

https://github.com/JakeHattwell/wormjam/issues/11

This convention would generate nice-looking BiGG IDs. A couple examples:

Phosphatidylcholines:
1,2-diacylglycerophosphocholine --> 1ac2acg3pc (old pchol)
1-acylglycerophosphocholine --> 1acg3pc
2-acylglycerophosphocholine --> 2acg3pc

1-alkyl-2-acylglycerophosphocholine --> 1alk2acg3pc
1-alkylglycerophosphocholine --> 1alkg3pc

1-alkenyl-2-acylglycerophosphocholine --> 1alken2acg3pc
1-alkenylglycerophosphocholine --> 1alkeng3pc

If we adopt this, we will probably do it through the standard BiGG process: we will not remove old IDs (so pchol will stay), and new IDs will come in as we add models to BiGG. However, we can provide an extra level of checking to help users adopt these new IDs.

Tagging some BiGG watchers who might have feedback on this: @nel3 @neemajamshidi @matthiaskoenig @draeger @rmtfleming @phantomas1234 @willigott @cdanielmachado @smoretti @djinnome @cnorsig @jmcconn @jtyurkovich

cdanielmachado commented 4 years ago

Sounds like a good plan to me.

draeger commented 4 years ago

It seems reasonable. A question remains: Will old and new IDs coexist in the future? Would BiGG list the old IDs then as legacy identifiers and store them in a separate table?

zakandrewking commented 4 years ago

@draeger In this case, we will not replace any old IDs. Just add new ones.

matthiaskoenig commented 4 years ago

I don't like that this are not valid SBML identifiers, also it should be made clear that these are all triglycerols, i.e. using a glycerol backbone with possible three connections. There are other backbones which allow a wide range of connections, by being too specific here these other things cannot be encoded in a uniform manner)

I would highly recommend to prefix these to make the ids more clear, valid (SBML) and usable:

1,2-diacylglycerophosphocholine --> tg1ac2acg3pc (old pchol)
1-acylglycerophosphocholine --> tg1acg3pc
2-acylglycerophosphocholine --> tg2acg3pc

1-alkyl-2-acylglycerophosphocholine --> tg1alk2acg3pc
1-alkylglycerophosphocholine --> tg1alkg3pc

1-alkenyl-2-acylglycerophosphocholine --> tg1alken2acg3pc
1-alkenylglycerophosphocholine --> tg1alkeng3pc

michaelwitting commented 4 years ago

I think added the triglycerol here causes confusion with triacylglycerols. The name already contains g3pc, which is the ID for glycero-3-phosphocholine, the 1ac and 2ac state that it has an additional acyl group at positions 1 and 2.

matthiaskoenig commented 4 years ago

It will make it much easier to work with all triacylglycerols, if these have a common prefix because you can just filter the subset based on the prefix and listing the remaining 3 chains, i.e., than I even understand the rules for creating all possibilities:

1. start with tg prefix
1. list the up to 3 chains starting with the respective number of the connection, 1 is hereby the ... C atom (clarify this for chirality, i.e. which one is the 1); if there is no connection at one of the 3C leave it out. Order from 1 to 3.
```
1,2-diacylglycerophosphocholine --> tg1ac2ac3pc (old pchol)
1-acylglycerophosphocholine --> tg1ac3pc
2-acylglycerophosphocholine --> tg2ac3pc
```

1-alkyl-2-acylglycerophosphocholine --> tg1alk2ac3pc 1-alkylglycerophosphocholine --> tg1alk3pc

1-alkenyl-2-acylglycerophosphocholine --> tg1alken2ac3pc 1-alkenylglycerophosphocholine --> tg1alken3pc

matthiaskoenig commented 4 years ago

By the way also super easy to parse and no dependency if the pc is on 1 or 3.

michaelwitting commented 4 years ago

I see your points, but not all of them are tri-acyl-glycerols. Based on lipid biochemistry the glycerol-backbone is fixed to be sn-glycero-3-phosphate (coming from the synthesis). We could do it the other way round, having the lipid class in front and then the chain configuration, e.g. pc1ac2ac. Ideally the nomenclature should be consistent (at least in part) with the nomenclature used in the lipidomics field.

matthiaskoenig commented 4 years ago

Only listing the chains without a prefix will create problems if there is only one modification, which for instance would than be pc1 or 1pc which is probably already used as id for other things. By having a clear prefix the namespace becomes unique. Also there are other backbones which would require a prefix, e.g. the sphingolipids, which then have to be something like sp2ac3pc to disinguish from 2ac3pc, so why not name it something like g2ac3pc or tg2ac3pc, then it is clear from the id what the backbone is.

michaelwitting commented 4 years ago

Then I would go for version with g instead of tg, which avoids confusion with real triacylglycerols. Sphingolipids will become a bit more tricky in that regard, because there are several backbones possible. Rules for encoding the backbone would be need. For example C. elegans uses C17iso sphingoid bases, which are not found in mammals. I will think about different ways for the phospholipids.

michaelwitting commented 4 years ago

Made up my mind. We should go for the prefix. Works also well with some IDs that are already in BiGG. A example for glycero- and glycerophospholipids: g3pc and g3pe are already in BiGG for sn-glycero-phosphocholine and sn-glycero-phosphoethanolamine. g1ac3pc would then represent a 1-acyl-sn-glycero-phosphocholine etc. I will prepare a table with the "old" IDs and the new systematic ones. I would anyway need this for our WormJam model.

michaelwitting commented 4 years ago

Here is a table with examples from the WormJam model.

Class	Metabolite	Old / Wrong / Duplicated ID (WormJam)	Correct / New ID
MG	1-acyl-sn-glycerol	1magol	g1ac
MG	2-acyl-sn-glycerol	mag	g2ac
MG-O	1-alkyl-sn-glycerol	---	g1alk
MG-P	1-(Z)-alk-1-enyl-sn-glycerol	alkenglyc	g1alken
DG	1,2-diacyl-sn-glycerol	12dag	g1ac2ac
DG-O	1-alkyl-2-acyl-sn-glycerol	akac2g	g1alk2ac
DG-P	1-(Z)-alk-1-enyl-2-acyl-glycerol	alkenac2g	g1alken2ac
TG	Triacyl-glycerol	tag	g1ac2ac3ac
TG-O	1-alkyl-2,3-diacylglycerol	---	g1alk2ac3ac
TG-P	1-(Z)-alk-1-enyl-2,3-diacylglycerol	---	g1alken2ac3ac
DHAP	1-acylglycerone 3-phosphate	Adhap	dhap1ac
DHAP-O	1-alkylglycerone 3-phosphate	akdhap	dhap1alk
PA	1,2-diacyl-sn-glycero-3-phosphate	pa_pl	g1ac2ac3p
PA	1,2-diacyl-sn-glycero-3-phosphate	12dag3p	g1ac2ac3p
LPA	1-acyl-sn-glycero-3-phosphate	alpa	g1ac3p
LPA	1-acyl-sn-glycero-3-phosphate	alpa_tag	g1ac3p
LPA	1-acyl-sn-glycero-3-phosphate	1ag3p_SC	g1ac3p
LPA	2-acyl-sn-glycero-3-phosphate	---	g2ac3p
LPA-O	1-alkyl-sn-glycero-3-phosphate	alkgp	g1alk3p
PA-O	1-alkyl-2-acyl-sn-glycero-3-phosphate	akac2gp	g1alk2ac3p
LPA-P	1-(Z)-alk-1-enyl-sn-glycero-3-phosphate	---	g1alken3p
PA-P	1-(Z)-alk-1-enyl-2-acyl-sn-glycero-3-phosphate	---	g1alken2ac3p
PC	1,2-diacyl-sn-glycero-3-phosphocholine	pchol	g1ac2ac3pc
LPC	1-acyl-sn-glycero-3-phosphocholine	ag3pc	g1ac3pc
LPC	2-acyl-sn-glycero-3-phosphocholine	2agpc	g2ac3pc
PC-O	1-alkyl-2-acyl-sn-glycero-3-phosphocholine	akac2gchol	g1alk2ac3pc
LPC-O	1-alkyl-sn-glycero-3-phosphocholine	ak2lgchol	g1alk3pc
PC-P	1-(Z)-alk-1-enyl-2-acyl-sn-glycero-3-phosphocholine	---	g1alken2ac3pc
LPC-P	1-(Z)-alk-1-enyl-sn-glycero-3-phosphocholine	---	g1alken3pc
PE	1,2-diacyl-sn-glycero-3-phosphoethanolamine	pe	g1ac2ac3pe
PE	1,2-diacyl-sn-glycero-3-phosphoethanolamine	pe_BAC	g1ac2ac3pe
LPE	1-acyl-sn-glycero-3-phosphoethanolamine	acg3pe	g1ac3pe
LPE	2-acyl-sn-glycero-3-phosphoethanolamine	---	g2ac3pe
PE-O	1-alkyl-2-acyl-sn-glycero-3-phosphoethanolamine	akac2gpe	g1alk2ac3pe
LPE-O	1-alkyl-sn-glycero-3-phosphoethanolamine	---	g1alk3pe
PE-P	1-(Z)-alk-1-enyl-2-acyl-sn-glycero-3-phosphoethanolamine	alkenac2gpe	g1alken2ac3pe
LPE-P	1-(Z)-alk-1-enyl-sn-glycero-3-phosphoethanolamine	alken2gpe	g1alken3pe
PS	1,2-diacyl-sn-glycero-3-phospho-L-serine	ps	g1ac2ac3ps
LPS	1-acyl-sn-glycero-3-phospho-L-serine	acg3ps	g1ac3ps
LPS	2-acyl-sn-glycero-3-phospho-L-serine	---	g2ac3ps
PI	1,2-diacyl-sn-glycero-3-phospho(1)-D-myo-inositol	pail	g1ac2ac3pi
LPI	1-acyl-sn-glycero-3-phospho(1)-D-myo-inositol	---	g1ac3pi
LPI	2-acyl-sn-glycero-3-phospho(1)-D-myo-inositol	---	g2ac3pi
PIP	1,2-diacyl-sn-glycero-3-phospho(1)-D-myo-inositol-3-phosphate	pail3p	g1ac2ac3pi3p
PIP	1,2-diacyl-sn-glycero-3-phospho(1)-D-myo-inositol-4-phosphate	pail4p	g1ac2ac3pi4p
PIP	1,2-diacyl-sn-glycero-3-phospho(1)-D-myo-inositol-5-phosphate	pail5p	g1ac2ac3pi5p
PIP2	1,2-diacyl-sn-glycero-3-phospho(1)-D-myo-inositol-3,4-bisphosphate	pail34p	g1ac2ac3pi3p4p
PIP2	1,2-diacyl-sn-glycero-3-phospho(1)-D-myo-inositol-3,5-bisphosphate	pail35p	g1ac2ac3pi3p5p
PIP2	1,2-diacyl-sn-glycero-3-phospho(1)-D-myo-inositol-4,5-bisphosphate	pail45p	g1ac2ac3pi4p5p
PIP3	1,2-diacyl-sn-glycero-3-phospho(1)-D-myo-inositol-3,4,5-trisphosphate	pail345p	g1ac2ac3pi3p4p4p
PGP	1,2-diacyl-sn-glycero-3-phospho-(1ʼ-sn-glycero-3ʼ-phosphate)	pgp	g1ac2ac3pg3p
PG	1,2-diacyl-sn-glycero-3-phospho-(1'-sn-glycerol)	pg	g1ac2ac3pg
PG	1,2-diacyl-sn-glycero-3-phospho-(1'-sn-glycerol)	pg_BAC	g1ac2ac3pg

With multiple phosphorylated PI headgroups (PIP, PIP2 and PIP3) the back part of the ID gets a bit complicated, but I guess it is still fine. Alternative is to separate this with a _, e.g. g1ac2ac3pi_3p4p. CDP-DGs are still open. They would be g1ac2ac3cdp

matthiaskoenig commented 4 years ago

This looks great. Some comments below, not sure we have to solve all of this.

Perhaps we could add an example for a phophatidylcholin?
How to write cardiolipin? or use other abbreviation for this?
Should we do the sphingolipids analog? Can we add some examples?
How to encode the various variants of ac side chains? It would be great to have some convention for this.

The following often occurs: ac=ac16 (palmitate by default, what means ac exactly?) ac18 (stearate) ac20, ac22, ac24 What about unsaturated variants (needs position of double bound and cis/trans)?

michaelwitting commented 4 years ago

This is now for the moment for only the generic versions. I already though about ways how to encode specific acyl, alkyl or alkenyl chains. The position and stereochemistry of the double bond should be encoded. I developed something that is suitable for fatty acids, acyl-CoAs etc (everything that has a single acyl chain). I wrote my thoughts down in a manuscript type of document, I think I will put it on a preprint server soonish. @matthiaskoenig if you want I can send the current rough and preliminary version via eMail to check.

michaelwitting commented 4 years ago

I will think about the cardiolipins and sphingolipids, might a bit tricky.

matthiaskoenig commented 4 years ago

@michaelwitting Yes, please send the preprint. I will give you feedback on it (konigmatt[AT]googlemail.com).

michaelwitting commented 4 years ago

Hi all. I would like to revive the discussion here. I was thinking about how to encode the side chains and the sphingoid bases etc. The main question is which level of detail is required. It would be good to have enough details to be able to reconstruct the chemical structure. In lipidomics shorthand notations like PC(16:0/16:1(9Z)) are used. Maybe this can be adapted?

SBRG / bigg_models

Call for feedback: New lipid naming convention #354