MMCIF reader parsing mistakes, some keys are missing / corrupted

marcom commented 4 years ago

While running the MMCIF reader over the whole PDB archive, my run aborted on PDB entry 1JUF. There seems to be some mixups in parsing the file, with some keys missing and some data values ending up being interpreted as keys. There might be more pdb entries with parsing problems, i'll try and collect all the failures i can find in the next days.

Perhaps it would be good to have a more detailed test that also parses the whole PDB with the biopython MMCIF reader (or perhaps there is a reference mmcif parser) and checks if the generated dictionaries are exactly the same.

Steps to Reproduce

using BioStructures
cif = MMCIFDict("1JUF.cif")
cif["_entry.id"]            # works
cif["_exptl.method"]        # fails with missing key, but grep shows it's there
keys(cif)                   # some keys look like data values

I tried playing around deleting sections or adding them to a new cif file but wasn't able to isolate the problem yet.

Your Environment

BioStructures v0.11.0
Julia 1.5.1
Linux, openSUSE 15.0

jgreener64 commented 4 years ago

Thanks for reporting. The problem is line 65 of 1JUF:

_citation.pdbx_database_id_DOI      10.4049/&#8203;jimmunol.168.1.283

It has # in, but that is allowed in unquoted strings provided there is no preceding space.

We stop reading the line as soon as # is reached (https://github.com/BioJulia/BioStructures.jl/blob/master/src/mmcif.jl#L107) so that token is not read and all following parsing is garbled.

I think it's a one-line fix, which I can push tomorrow and tag a release. This problem may effect other mmCIF dictionaries, though it shouldn't effect any structures read in as all but the atom entries are ignored in that case.

jgreener64 commented 4 years ago

This is now fixed and the fix is in version v0.11.1.

I thought the issue looked familiar, and in fact dug out where it was fixed in Biopython: https://github.com/biopython/biopython/pull/1615.

I just made a script to check BioStructures mmCIF dictionaries against Biopython's: https://github.com/BioJulia/BioStructures.jl/blob/master/test/mmcif_biopython.jl. It's running now so I'll investigate any more discrepancies.

marcom commented 4 years ago

Many thanks for the quick fix!

BioJulia / BioStructures.jl

MMCIF reader parsing mistakes, some keys are missing / corrupted #23

Steps to Reproduce

Your Environment