Chain id issue when using cath-superpose to orient whole PDBs

nataliedawson commented 6 years ago

Hi Tony,

I hope all's well. As suggested, I've been working on using cath-superpose to consistently orient whole, chain, and domain PDB files. The orientation of a few examples I've used for these 3 data types looks great.

One issue I've noticed is with the creation of a cath-superpose-converted PDB file for whole PDBs that have multiple protein chains. It appears that the output PDB file from running it through cath-superpose converts all atoms within multiple chain ids to only chain A.

For example:

PDB 3ccu is a very large structure with numerous protein chains and nucleic acid chains. Running this command, I can see that the resulting temppdb file now only has a single chain id (id: A) with a single 'TER' after all the protein and nucleic acid chains (and before all the HETATOM lines):

cat <cath_current_dir>/wholepdb/3ccu | cath-superpose --pdbs-from-stdin --sup-to-stdout > /tmp/3ccu.temppdb

Please find the PDB files attached for before and after running cath-superpose. 3ccu.zip

I believe I'm using the latest version of cath-superpose:

cath-superpose -v
============
cath-superpose v0.16.2-0-ga9f860c [2018-01-09]
============

Superpose protein structures using an existing alignment

Build
-----
   Jan  9 2018 21:00:36
   GNU C++ version 4.9.2 20150212 (Red Hat 4.9.2-6)
   GNU libstdc++ version 20150212
   Boost 1_60

Thanks!

tonyelewis commented 6 years ago

Thanks for this. At present, this chain-code-munging is actually the expected behaviour for the --sup-to-pdb-file and --sup-to-stdout options:

--sup-to-pdb-file arg      Write the superposed structures to a single PDB file arg, separated using faked chain codes
--sup-to-stdout            Print the superposed structures to stdout, separated using faked chain codes

The motivation is to separate out the multiple PDBs in multiple-PDB superpositions because otherwise chains from different PDBs get treated as the same chain and things go bad.

As things stand, the way to avoid this is to use --sup-to-pdb-files-dir to write the file(s) to a directory instead:

--sup-to-pdb-files-dir arg Write the superposed structures to separate PDB files in directory arg

That should work but is slightly more fiddly.

But should the options be changed?

One option would be to turn off the chain-code-munging when there's only one input. But I think that might be really unhelpfully inconsistent for anyone using --sup-to-pdb-file / --sup-to-stdout on datasets that includes examples with one / more entries.

Alternative possible options I could add:

don't munge chain codes
don't munge chain codes because I know there'll only be one entry; barf if there's more than one
write the (non-chain-code-munged) single entry to file X; barf if there's more than one

...but I'm always conscious of the making the usage too complicated through lots of options for lots of special cases.

Any thoughts @nataliedawson, @sillitoe ?

nataliedawson commented 6 years ago

@tonyelewis Thanks for this and sorry for missing that expected behaviour.

I've tried using --sup-to-pdb-files-dir and that has indeed provided a file with all the different chains, thanks!

Now that I know about this option, I don't think you need to make any changes as this works well for what I need to do.

tonyelewis commented 6 years ago

@tonyelewis Thanks for this and sorry for missing that expected behaviour.

NP. Please don't apologise - this is a wrinkle in the interface and it's very helpful when users take the time to let me know about the sorts of issues they're encountering.

I'll leave this issue alone for now (unless I hear more).

For future reference, another alternative I could consider is a (multiply-specifiable) option like:

write the structure for ID X to file Y

This might look something like:

cath-superpose [...] --pdb-for-id-to-file 1cukA01:~/1cukA01.pdb --pdb-for-id-to-file 1bvsA01:~/1bvsA01.pdb

This has the benefits that:

it allows users with single-structure cases to get their single PDB file without having to generate a directory containing the one file
it can be consistent between single-structure and multiple-structure cases

(Maybe it could allow a special syntax like --pdb-for-id-to-file _only_id_:~/1cukA01.pdb that means use the one and only ID or barf if there isn't exactly one.)

sillitoe commented 6 years ago

Sounds like using --sup-to-pdb-files is path of least resistance.

In which case, this possibly comes to documentation.

How about a loud warning if it's obvious that munging chain ids is going to remove structures from the output (possibly pointing towards the --sup-to-pdb-files option?)

I guess a "proper" solution would probably be to use mmCIF files ... :)

tonyelewis commented 6 years ago

How about a loud warning if it's obvious that munging chain ids is going to remove structures from the output (possibly pointing towards the --sup-to-pdb-files option?)

Yes. The chain-code-munging is a hack (inherited from the old rasmol-script-generating Perl scripts of yore) that made a lot of sense for single-chain structures but that is now particularly ill-suited to the multi-chain structures we're increasingly bothering to deal with. I agree: if it encounters any multi-chain structures when it's chain-code-munging, it should at least warn, but probably barf (with refs to other options).

I'll open a new issue and reference this one.

I guess a "proper" solution would probably be to use mmCIF files ... :)

Sorry? What was that dear? I can't hear you? :hear_no_evil: LA LA LA LA LA LA LA LA

tonyelewis commented 6 years ago

I've just pushed 2e25046ceae26132fbc030399f62b3da8adf0aef, which should ensure you now see a sensible warning if you try to use these options with a multi-chain PDB.

I've gone for warning rather than barfing, in part because I'm not 100% sure whether some of the Genome3D superpositions might be affected. @sillitoe, does the Genome3D call to cath-superpose use --sup-to-pdb-file or --sup-to-stdout?

UCLOrengoGroup / cath-tools

Chain id issue when using cath-superpose to orient whole PDBs #68