Improve Sugar and Other Ligand Distinctions for Selections

BradyAJohnston commented 2 years ago

Referencing discussion in #20 about better methods fir distinguishing sugars and other modifications to the protein structure, as well as ligands.

In example 6WTH, the oligosaccharides are treated as separate chains, as they are stated in the PDB (H, I, J). The NAG ligands are categorised as a one chain in the selection node:

This is because the chains are allocated based on the matching to a dictionary of the unique molecules in the structure.

In [7]: import atomium as at

In [8]: pdb = at.fetch("6WTH")

In [9]: pdb.model.molecules()
Out[9]:
{<Chain A (381 residues)>,
 <Chain B (413 residues)>,
 <Chain C (389 residues)>,
 <Chain D (105 residues)>,
 <Chain E (118 residues)>,
 <Chain F (115 residues)>,
 <Chain G (73 residues)>,
 <Chain H (2 residues)>,
 <Chain I (2 residues)>,
 <Chain J (2 residues)>,
 <Ligand NA (A.700)>,
 <Ligand NAG (A.701)>,
 <Ligand NAG (A.702)>,
 <Ligand NAG (B.700)>,
 <Ligand NAG (B.705)>}

The different NAG ligands are also assigned residue numbers inside of atomium it seems, (e.g. A.701) which could potentially be used for selections. Currently this AA_sequence_number isn't passed on to Molecular Nodes and is likely lost somewhere in the parsing.

Unsure whether each ligand should be treated as its own separate chain, or if there should be some other kind of selection field that is usable for ligands etc.

@PlethoraChutney would welcome thoughts / comments!

PlethoraChutney commented 2 years ago

Oh that's very interesting. That's how the boss builds sugars (separate chain), but mine are part of the chain to which they're attached. When loading my unreleased PDB into MolNodes there is also a chain each for NAG, MAN, and BMA, with two, one, and one sugar in the respective selections. All are part of the same tetrasaccharide. Chains A, B, and C (which is where I prefer to build sugars) still have the majority of the sugars; these are distinguishable only by turning off all other residues. Interestingly, the NAG, MAN, BMA sugars appear only in those chains --- they are not selected by chain A, B, or C.

What's weirder is that the residues atomium has selected as being separate molecules, and thus selectable in MolNodes, are as far as I can tell selected at random. They are not part of a separate chain in the PDB file, nor does PyMol or ChimeraX treat the sugars as part of anything but the chain they're built on.

My first guess would be that perhaps atomium stores an example of each type of sugar it encounters as a ligand for some reason, but I'd expect it to store the first of each type of sugar in chain A if that were the case, and not stuff from chain C and two copies of NAG, as below:

In [1]: import atomium as at

In [2]: pdb = at.open('secret-sorry.pdb')

In [3]: pdb.model.molecules()
Out[3]:
{<Chain A (418 residues)>,
 <Chain B (493 residues)>,
 <Chain C (453 residues)>,
 <Ligand BMA (C.763)>,
 <Ligand MAN (C.764)>,
 <Ligand NAG (C.761)>,
 <Ligand NAG (C.762)>}

In [4]: pdb.model.chains()
Out[4]: {<Chain A (418 residues)>, <Chain B (493 residues)>, <Chain C (453 residues)>}

Mystery! I would be happy to send you this PDB if you pinky promise not to scoop me, or after my dissertation advisory comittee meeting tomorrow (lol) I'll dig around the PDB for some other models with heavy glycosylation.

Pymol showing everything but chains A, B, and C (i.e., nothing):

Blender showing chains NAG, BMA, MAN (only this single tetrasaccharide is shown):

PlethoraChutney commented 2 years ago

Realize I didn't add that many useful thoughts you actually asked for...

For glycosylation, I'd most often want to show at a polysaccharide level specificity, i.e., "All the sugars on Asn 200" or whatever. For other ligands, like some kind of inhibitor, you'd probably only have one or two anyway, so all/none would most likely be fine.

In fact, the only reason I wanted to select the sugars was to change their representation. I think everything else could be accomplished by combining a "ligand"-level selector with a distance node. So probably being able to select a pseudochain of, e.g., every NAG would be sufficient for any uses I can think of, because you could just use the distance node to get more specific than that.

BradyAJohnston commented 2 years ago

Okay thanks for the details! Seems like if we have AA_sequence_number be associated with the appropriate sugars, then you can combine the AA number and the chain IDs for sugar / ligands.

Would making just a 'oligosacharides' chain make sense? Or one for each type? Could also utilise the AA_name and have it be more general purpose to also include sugars. It's already being used for DNA / RNA so expanding it and maybe renaming it for any kind of repeating monmer might be useful

PlethoraChutney commented 2 years ago

I think if it's easy it'd be best to use AA_name, just because there's a world where one might want to style the different sugar types differently, for instance

BradyAJohnston commented 1 year ago

If you have the time, are you able to have a play around with the development version of Molecular Nodes? I haven't created a new release for it, but you can now download the repo .zip and install the addon that way and see how you go.

I'm added specific attributes for is_carb etc, and Biotite which now powers the .pdb parsing might handle the chains / numbering etc a bit better.

From testing the 6WTH pdb, they'll all be part of the same chain, but I can potentially now have a separate selection that is some combination of res_name & res_id & chain_id that should be able to distinguish between most ligands.

In the mean time, you can also currently use the Mesh Island node, and use the Island Index to get the distinct ligands / chains for selections, which allow you to potentially select & style specific ligands.

Any and all feedback welcomed!

BradyAJohnston commented 1 year ago

Additionally with glycans in particular, from my understanding, there should be a limited list of potential glycans that are possible to have in a structure, yes? If so, is there a dictionary / table of them somewhere - this could be hardcoded in and have it's own separate node for selection based on particular glycans etc

PlethoraChutney commented 1 year ago

Sorry I didn't have time before release, but it works!!

res_id is awesome. The two common ways of adding glycans are: same chain as protein but with a high index or distinct glycan chain. Either way, the new selector nodes will be able to grab them.

Additionally with glycans in particular, from my understanding, there should be a limited list of potential glycans that are possible to have in a structure, yes?

caveat: I am not a glycobiologist;

Yep! There are loads of sugars in the PDB, but I'd bet the ones you'll see nine times out of ten are GlcNac (NAG), $\beta$-D-mannopyranose (BMA), and $\alpha$-D-mannopyranose (MAN), since those compose the base chain of human N-glycosylations.

Thanks so much Brady! So excited to play with the new version!

BradyAJohnston commented 1 year ago

This should now additionally be improved inside of 2.2.1. There is a new selection node 'Ligands' which should basically lump everything inside of it which isn't in the residues dictionary (amino acids / nucleic acids).

So HOH for water, sugars, haeme groups etc should all end up being assigned a unique res_name which starts at 100 and increments from there. Just like with the chain selection nodes, the structure-specific node is then built when you create the node, and should allow for selections that you are after.

The way I've done it should hopefully consider each sugar to be separate regardless if they are numbered or chained separately. I've tested with 6WTH and each sugar can be selected individually, even those which are bonded together.

BradyAJohnston / MolecularNodes

Improve Sugar and Other Ligand Distinctions for Selections #79