douweschulte / pdbtbx

A library to open/edit/save (crystallographic) Protein Data Bank (PDB) and mmCIF files in Rust.
https://crates.io/crates/pdbtbx
MIT License
49 stars 12 forks source link

mmCIF parser has column requirements not in line with mmCIF specification #93

Closed tzok closed 2 years ago

tzok commented 2 years ago

The list of columns required by the parser is the following:

The mmCIF specification for atom_site mentions only these as required:


In particular, I have a problem with atom_site.pdbx_formal_charge. According to the docs it is used in about 7.4% entries in the PDB. Making it a strict requirement in pdbtbx is incorrect IMHO.

Fixing pdbx_formal_charge is essential to me. Making the parser more robust in general by complying with mmCIF is a good thing in the long term anyway

douweschulte commented 2 years ago

I do agree, I will work on removing the requirements and figuring out sensible ways of dealing with missing data.

douweschulte commented 2 years ago

I removed the requirements on the following columns:

The other columns are (according to your link) always present. Although I will think about relaxing the requirements some more while refactoring the code some more.

douweschulte commented 2 years ago

I additionally removed the requirement for the following columns:

Leaving the Cartn columns to be the only columns that are required on top of the mmCIF requirements. But those are (according to your link) present in 100% of the files, and very sensible to be defined if you want to use this library. In the future this requirement could be removed in favour of requiring any position, fract or Cartn, but that can wait.

Thanks for the issue, this gave me a push to do some nice refactorings in the parse_atoms function and you pointed me to a nice docs site that I never found before.