Erroneous merging of contact and mmcif data

stuartmac commented 7 years ago

I noticed that TableMerger.merge() was giving waters and other HETATMs UniProt IDs and ResNums.

Eventually I tracked the problem to contacts_mmcif_table_merger:

cmtm_test = merger.contacts_mmcif_table_merger(contacts_table, pdbx_table, 'A')
cmtm_test.query('auth_seq_id_full_A == "181"')[['ATOM_A',
                                                'auth_seq_id_full_A',
                                                'label_asym_id_A',
                                                'label_comp_id_A']].unique()

ATOM_A	auth_seq_id_full_A	label_asym_id_A	label_comp_id_A
HOH	181	A	THR
GLU	181	C	THR

The mmcif table is merged to contacts via 'new_seq_id' and 'new_asym_id'. These keys look wrong:

cmtm_mmcif_keys = ['auth_seq_id_full', 'label_asym_id', 'new_seq_id', 'new_asym_id']
pdbx_table.query('auth_seq_id_full == "181"')[cmtm_mmcif_keys]

auth_seq_id_full	label_asym_id	new_seq_id	new_asym_id
181	C	241	A
181	A	1	A

biomadeira commented 7 years ago

@stuartmac I think I understand the problem!

contacts_mmcif_table_merger is set to merge based on new_seq_id and new_asym_id if these columns are available. This approach would have worked well if arpeggio had been run using a reformated PDB file. This PDB file would have been written with PDBXwriter.run(pro_format=True) from the original mmCIF/PDB.

Basically, the idea was that it would be possible to merge contacts to mmCIF on this new_seq_id and new_ayms_id labels (looped over from res 1 to 9999 chain A, and then starting over again on chain B, and so on).

Afer this merging, we can still use the real auth/label seq and asym ids to work around contact types, domain-domain, intra/inter-chain, etc. This assumes interactions in arpeggio are agnostic of inter/intra-chain contacts, which I think they are (i.e. it does not make a difference whether a res-res contact is occurring within the same chain or not).

I need to add more docs on how to use this and other approaches...

stuartmac commented 7 years ago

@biomadeira OK, so for now I'm using it by forcing it to alway use auth_seq_id_full and label_asym_id.

Next problem is that I'm losing most ligands. Have tracked this down to the fact that 1) They don't have a SIFTS entry and 2) label_asym_id and auth_asym_id are different...

table_merger does left joins in order of mmCIF > SIFTS > contacts so we end up with a merged table where only the mmCIF record is preserved, i.e. we lose the contacts.

I proposed switching the default merge key for contacts to auth_asym_id but you're own investigation tracked it down to different defaults inwrite_pdb_from_table category defaults depending on where it was invoked, sometimes it was label, others auth.

Maybe need more consistent defaults and also set merge keys via arguments and use the current coded keys as defaults...

bartongroup / ProIntVar

Erroneous merging of contact and mmcif data #12