Dealing with large structures (mmCIF only): chain identifiers with >1 character and/or >100,000 atoms

josemduarte commented 9 years ago

In December 2014 (http://www.wwpdb.org/news/news_2014.html#18-September-2014), the split entries will be unified into single mmCIF files. For those entries, the chain identifiers will have more than 1 character.

This is the mapping to mmCIF dictionary terms of the chain identifiers used by owl:

pdbChainCode = auth_asym_id or pdb_strand_id (strictly 1 character, used in PDB files)
chainCode = asym_id (can have > 1 character, not present in PDB files, ligands get separate codes)

In eppic we use pdbChainCode, since it is the one identifier recognised by the structural biology community.

For the large structures that will be released as mmCIF only (see ftp://ftp.wwpdb.org/pub/pdb/data/large_structures/mmCIF/), BOTH identifiers will be breaking the 1 character limit. Thus in eppic we will need to handle pdbChainCodes of more than 1 character: we can easily change the database schema to allow this, but the PDB files will not be correctly written (chain identifiers will be truncated to 1 character).

Note a fundamental difference between the two identifiers: in pdbChainCode ligands are included in same chain as proteins, while in chainCode ligands have separate identifiers.

This relates to some BioJava issues: https://github.com/biojava/biojava/issues/173, https://github.com/biojava/biojava/issues/156

Fixing this should go together with the move to BioJava.

josemduarte commented 9 years ago

As a first fix for the data model, we should simply make sure that all fields that are storing chain identifiers in the db model are defined as varchar(4). Done with this setting in orm.xml:

<column length="4" />

josemduarte commented 9 years ago

There's 2 issues writing PDB files for these structures:

chain ids with more than 1 letter
more than 100,000 atoms, thus atom ids >100,000 (see #29 )

josemduarte commented 7 years ago

I'm going to have a look into this. After the move to Biojava some of the stuff above doesn't really apply anymore, e.g. we now use mmcif file as the output format.

A good test case is 4v9e

josemduarte commented 7 years ago

This should be solved now with commit 01ec45f. I added an integration test for it. What I haven't tested is uploading the entry to db and checking if all is ok in web ui. But the data should be correctly represented in db now.

josemduarte commented 7 years ago

I've now tested uploading to a new database and it seems that all data is correctly represented in the db. It also shows fine in web ui. The only problem I see now is some minor issues with handling of the multi-letter chains by molecular viewers: pymol and ngl. I'll add a new issue for that.

eppic-team / eppic

Dealing with large structures (mmCIF only): chain identifiers with >1 character and/or >100,000 atoms #23