API2: chain names / residue numbering

sillitoe commented 6 years ago

Should the client control chain name(s) and residue numbering of the resulting model?

In the simplest case, the client would get simply named chains (A,B,C,...) for each item in the input list and a residue numbering based on the provided target sequence (i.e. first letter = 1). Otherwise the model can have renamed and renumbered chains according to the clients' desires.

sillitoe commented 6 years ago

Hmmm...

I think it makes sense to use a common coordinate system to which we can all map our own data. I was going to suggest that was PDB, but if we're talking chain ids, then I guess it should be mmCIF.

We would probably use the PDBe SIFTS API for mapping.

https://www.ebi.ac.uk/pdbe/api/doc/sifts.html

It probably makes sense to use these lookups / fields as standard? (especially since the PDBe are also on this grant).

https://www.ebi.ac.uk/pdbe/api/mappings/cath/1cbs

uses: struct_asym_id

gtauriello commented 6 years ago

Just to be clear: for the chain names and residue numbers I was referring to the ones for the produced model which I assume should be related to the target sequence and not to the template sequence/structure. So in my opinion there is no PDB/SIFTS frame for the produced model, but it's either the target sequence or something provided by the user (which may be based on the UniProt sequence from where the target sequence comes from).

That being said: we could use the chain names from the template yes. We cannot possibly use the template's residue numbering though due to insertions/deletions.

sillitoe commented 6 years ago

Thanks for the clarification. You're right - this mapping doesn't play any part in the resulting model.

I think it probably makes sense to just number the residues in the resulting model from 1-n?

It would be useful to make sure everything we know about the template, alignment, numbering, etc is included in the REMARK section of the model. That way, all the information on mapping back to the structure is available to the client without having to refer back to the original submission data.

gtauriello commented 6 years ago

Ok. The numbering 1-n is what we usually do anyways, so it sounds good to me.

We already put some information in the REMARK section of our models (see e.g. here) but we can extend this as needed. Although our plan is to move away from non-standard REMARK sections to a properly defined mmCIF file with data on how the modelling was performed. We have been working on that standardization for a while now and Bienchen is currently writing a prototype implementation to have this in SWISS-MODEL.

awaterho commented 6 years ago

I see the chain naming as very confusing to anyone using the API, basing their input on PDB chains, or even mmCIF names. It doesnt matter if its pdb or mmcif, as we rename everything in SMTL.

If you ask for chain G of 4xbg, you will get a model which is named chain B. If you ask for chain A of 4xbg, you will get a model named chain B. Ask for chain J, you will get chain B.

https://swissmodel.expasy.org/templates/4xbg.6

I wouldnt want anyone to have to further parse remarks out of a model, to find out what the chain is?

gtauriello commented 6 years ago

@awaterho what you point out is very specific to how SWISS-MODEL currently names chains and it's really easy to change this. So in this discussion we should focus on what the user wants and not on what is currently implemented.

My original proposal here was to ignore the chain name in SMTL/PDB for the produced model and rename them according to the order of the user input. So the first entry in the list provided by the user results in a chain named 'A', the second entry is a chain named 'B' and so forth...

The proposed alternative is that the user suggests chain names for the resulting model and we name the model chains accordingly. We could have that as optional fields anyways and fall back to the A,B,C,... naming if chain names are not provided?

awaterho commented 6 years ago

If a user asks for 4xbg chain G, I would expect to receive chain G, thats all

gtauriello commented 6 years ago

Ah so you would expect that the model chain names are the same as the chain names in the template? I didn't expect any naming in the template to be relevant for what the user expects in the model. Also we would run into problems if two parts of the same template chain are used as disconnected components (i.e. for disconnected domains). Then you would end up with two chains with the same name unless we treat this case somehow differently, but this might be even more confusing...

@sillitoe what makes most sense to you?

sillitoe commented 6 years ago

If API2 produces multiple ~~models~~ discrete structures (eg domains) and those ~~models~~ domains are going to be concatenated in one ~~file~~ model (PDB or mmCIF), then it probably does make sense to give each ~~model~~ structure a unique chain name ('A', 'B', ...).

If we only allow API2 to process a model with just one predicted domain structure at a time (ie we send each discrete component in separate requests rather than sending a list), then it might make more sense to reuse the chain name from the template.

I guess the latter might provide a more simple interface, but there might be more overhead.

gtauriello commented 6 years ago

Here I was referring to the #8 option where each entry in the input list translates to a disconnected component. My assumption is that all those components together shall refer to 1 model where for practical reasons, I would have proposed to treat each disconnected component as a separate chain. I do not consider these chains to be multiple models since I assume that they belong together and are in contact with each other. I also renamed the other issues to clarify this (hopefully).

Some background info:

if you have disconnected components that are in contact with each other (be it a disconnected domain or an oligomer) you wouldn't want to model the chains separately since this would lead to suboptimal models which may clash into each other once you plug them together
if our resulting model produces multiple chains for disconnected domains, we cannot use the template chain name as then we would have multiple chains with the same name (or at least I don't see a way around it)

sillitoe commented 6 years ago

Thanks for the clarification, I think I was already on the same page for once - though my terminology was confusing (corrected above).

Apologies in advance - I think you've already answered the following question when I visited. When you produce a model with different chains, how confidently are you able to predict how those components interact / pack together?

I can see that it would be very useful to produce a final model that ensures these components do not clash/overlap in 3D space. However, I assumed that predicting binding interfaces accurately was a really difficult problem - ie there would be much less confidence in getting this bit right, compared to how accurately you are able to predict the 3D structure of each component. Do you calculate multiple possible solutions for the packing arrangements? If so, does each solution have an associated confidence?

sillitoe commented 6 years ago

Regarding the original question of chain names. I agree that if the produced model contains multiple structures then it probably makes sense to name these 'A', 'B', ... and use numbering (1-n) by default.

It might be useful to give the client the option to override the default chain name and default numbering in the produced model (ie start from a number other than 1). If the incoming data is a list of fasta sequences, then we would have to decide how to add meta data (possibly worth getting the client to send this as data structure rather than fasta).

gtauriello commented 6 years ago

When you produce a model with different chains, how confidently are you able to predict how those components interact / pack together? Do you calculate multiple possible solutions for the packing arrangements? If so, does each solution have an associated confidence?

Once a template structure is chosen (which is up to the API-user in this case), we simply use the packing of the template structure. We do not try to repack it in any different arrangement. I think that for your case of domain homologues, this should be exactly what you need. I would have expected that the relative orientation of the single parts of a discontinuous domain should be preserved within a FunFam, right?

For the more general case of predicting the relative orientation of independent domains, I don't know if people actively looked at it, but I would expect to see the same that people have observed for oligomers. Basically what the CAPRI benchmarking results showed was that if you have template information, you perform best by using that information (instead of repacking) and you get rather good results. It all depends on availability of templates and (if you find many templates) on the ability to choose the best one (we do a fairly good job there as the Bertoni-2017 has shown). Either way, this is out of the scope of this API, since here we will assume that the API-user chooses the desired relative orientation by choosing the appropriate template structure.

For the original question: all good then. Let's have some default naming/numbering and provide optional fields in the API input to override it.

CATH-SWISSMODEL / cath-swissmodel-api

API2: chain names / residue numbering #9