Closed josemduarte closed 7 years ago
As a first fix for the data model, we should simply make sure that all fields that are storing chain identifiers in the db model are defined as varchar(4). Done with this setting in orm.xml
:
<column length="4" />
There's 2 issues writing PDB files for these structures:
I'm going to have a look into this. After the move to Biojava some of the stuff above doesn't really apply anymore, e.g. we now use mmcif file as the output format.
A good test case is 4v9e
This should be solved now with commit 01ec45f. I added an integration test for it. What I haven't tested is uploading the entry to db and checking if all is ok in web ui. But the data should be correctly represented in db now.
I've now tested uploading to a new database and it seems that all data is correctly represented in the db. It also shows fine in web ui. The only problem I see now is some minor issues with handling of the multi-letter chains by molecular viewers: pymol and ngl. I'll add a new issue for that.
In December 2014 (http://www.wwpdb.org/news/news_2014.html#18-September-2014), the split entries will be unified into single mmCIF files. For those entries, the chain identifiers will have more than 1 character.
This is the mapping to mmCIF dictionary terms of the chain identifiers used by owl:
pdbChainCode
=auth_asym_id
orpdb_strand_id
(strictly 1 character, used in PDB files)chainCode
=asym_id
(can have > 1 character, not present in PDB files, ligands get separate codes)In eppic we use
pdbChainCode
, since it is the one identifier recognised by the structural biology community.For the large structures that will be released as mmCIF only (see ftp://ftp.wwpdb.org/pub/pdb/data/large_structures/mmCIF/), BOTH identifiers will be breaking the 1 character limit. Thus in eppic we will need to handle
pdbChainCode
s of more than 1 character: we can easily change the database schema to allow this, but the PDB files will not be correctly written (chain identifiers will be truncated to 1 character).Note a fundamental difference between the two identifiers: in
pdbChainCode
ligands are included in same chain as proteins, while inchainCode
ligands have separate identifiers.This relates to some BioJava issues: https://github.com/biojava/biojava/issues/173, https://github.com/biojava/biojava/issues/156
Fixing this should go together with the move to BioJava.