douweschulte / pdbtbx

A library to open/edit/save (crystallographic) Protein Data Bank (PDB) and mmCIF files in Rust.
https://crates.io/crates/pdbtbx
MIT License
49 stars 12 forks source link

Unable to load poor-quality PDBs #97

Closed OWissett closed 1 year ago

OWissett commented 1 year ago

I am attempting to load a PDB file (5XH3) to do some testing.

On loading, I get:

thread 'structure::tests::from_pdbtbx' panicked at 'called `Result::unwrap()` on an `Err` value: [StrictWarning: Sequence Difference Database not found
    ╷
347 │ SEQADV 5XH3 GLY A  103  UNP  A0A0K8P6T ARG   132 ENGINEERED MUTATION
    ╵
For this sequence difference (chain: A) the corresponding database definition (DBREF) was not found, make sure the DBREF is located before the SEQADV
, StrictWarning: Sequence Difference Database not found
    ╷
348 │ SEQADV 5XH3 ALA A  131  UNP  A0A0K8P6T SER   160 ENGINEERED MUTATION
    ╵
For this sequence difference (chain: A) the corresponding database definition (DBREF) was not found, make sure the DBREF is located before the SEQADV
, StrictWarning: MASTER checksum failed
     ╷
4863 │ MASTER      313    0    4   10    9    0    8    6 2187    1   36   21
     ╵
The number of Atoms (2207) is different then posed in the MASTER Record (2187)
, LooseWarning: SEQRES inconsistent residues
    ╷
349 │ SEQRES   1 A  261  ASN PRO TYR ALA ARG GLY PRO ASN PRO THR ALA ALA SER
    ·                        ─── ─── ─── ─── ─── ─── ─── ─── ─── ───     ───
350 │ SEQRES   2 A  261  LEU GLU ALA SER ALA GLY PRO PHE THR VAL ARG SER PHE
    ·                    ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ───
351 │ SEQRES   3 A  261  THR VAL SER ARG PRO SER GLY TYR GLY ALA GLY THR VAL
    ·                    ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ───
352 │ SEQRES   4 A  261  TYR TYR PRO THR ASN ALA GLY GLY THR VAL GLY ALA ILE
    ·                    ───     ─── ─── ─── ─── ───     ─── ─── ─── ─── ───
353 │ SEQRES   5 A  261  ALA ILE VAL PRO GLY TYR THR ALA ARG GLN SER SER ILE
    ·                    ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ───     ───
354 │ SEQRES   6 A  261  LYS TRP TRP GLY PRO ARG LEU ALA SER HIS GLY PHE VAL
    ·                    ─── ───     ─── ─── ─── ─── ─── ─── ─── ─── ─── ───
355 │ SEQRES   7 A  261  VAL ILE THR ILE ASP THR ASN SER THR LEU ASP GLN PRO
    ·                        ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ───
356 │ SEQRES   8 A  261  SER SER ARG SER SER GLN GLN MET ALA ALA LEU GLY GLN
    ·                    ───     ─── ───     ───     ─── ───     ─── ─── ───
357 │ SEQRES   9 A  261  VAL ALA SER LEU ASN GLY THR SER SER SER PRO ILE TYR
    ·                    ─── ─── ─── ─── ─── ─── ─── ───         ─── ─── ───
358 │ SEQRES  10 A  261  GLY LYS VAL ASP THR ALA ARG MET GLY VAL MET GLY TRP
    ·                    ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ───
359 │ SEQRES  11 A  261  ALA MET GLY GLY GLY GLY SER LEU ILE SER ALA ALA ASN
    ·                    ─── ─── ───             ─── ─── ─── ─── ───     ───
360 │ SEQRES  12 A  261  ASN PRO SER LEU LYS ALA ALA ALA PRO GLN ALA PRO TRP
    ·                        ─── ─── ─── ─── ───         ─── ─── ─── ─── ───
361 │ SEQRES  13 A  261  ASP SER SER THR ASN PHE SER SER VAL THR VAL PRO THR
    ·                    ─── ───     ─── ─── ─── ───     ─── ─── ─── ─── ───
362 │ SEQRES  14 A  261  LEU ILE PHE ALA CYS GLU ASN ASP SER ILE ALA PRO VAL
    ·                    ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ───
363 │ SEQRES  15 A  261  ASN SER SER ALA LEU PRO ILE TYR ASP SER MET SER ARG
    ·                    ─── ───     ─── ─── ─── ─── ─── ─── ─── ─── ─── ───
364 │ SEQRES  16 A  261  ASN ALA LYS GLN PHE LEU GLU ILE ASN GLY GLY SER HIS
    ·                    ─── ─── ─── ─── ─── ─── ─── ─── ─── ───     ─── ───
365 │ SEQRES  17 A  261  SER CYS ALA ASN SER GLY ASN SER ASN GLN ALA LEU ILE
    ·                    ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ───
366 │ SEQRES  18 A  261  GLY LYS LYS GLY VAL ALA TRP MET LYS ARG PHE MET ASP
    ·                    ─── ───     ─── ─── ─── ─── ─── ─── ─── ─── ─── ───
367 │ SEQRES  19 A  261  ASN ASP THR ARG TYR SER THR PHE ALA CYS GLU ASN PRO
    ·                    ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ───
368 │ SEQRES  20 A  261  ASN SER THR ARG VAL SER ASP PHE ARG THR ALA ASN CYS
    ·                    ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ─── ───
369 │ SEQRES  21 A  261  SER
    ·                    ───
    ╰SEQRES definition
   ╷
1  │  ASN PRO TYR ALA ARG GLY PRO ASN PRO THR ... ALA
2  │  SER LEU GLU ALA SER ALA GLY PRO PHE THR VAL ARG SER
3  │  PHE THR VAL SER ARG PRO SER GLY TYR GLY ALA GLY THR
4  │  VAL ... TYR PRO THR ASN ALA ... GLY THR VAL GLY ALA
5  │  ILE ALA ILE VAL PRO GLY TYR THR ALA ARG GLN ... SER
6  │  ILE LYS ... TRP GLY PRO ARG LEU ALA SER HIS GLY PHE
7  │  VAL ILE THR ILE ASP THR ASN SER THR LEU ASP GLN
8  │  PRO ... SER ARG ... SER ... GLN MET ... ALA LEU GLY
9  │  GLN VAL ALA SER LEU ASN GLY THR ... SER PRO ILE
10 │  TYR GLY LYS VAL ASP THR ALA ARG MET GLY VAL MET GLY
11 │  TRP ALA MET ... GLY SER LEU ILE SER ... ALA
12 │  ASN PRO SER LEU LYS ... ALA PRO GLN ALA PRO
13 │  TRP ASP ... SER THR ASN PHE ... SER VAL THR VAL PRO
14 │  THR LEU ILE PHE ALA CYS GLU ASN ASP SER ILE ALA PRO
15 │  VAL ASN ... SER ALA LEU PRO ILE TYR ASP SER MET SER
16 │  ARG ASN ALA LYS GLN PHE LEU GLU ILE ASN ... GLY SER
17 │  HIS SER CYS ALA ASN SER GLY ASN SER ASN GLN ALA LEU
18 │  ILE GLY ... LYS GLY VAL ALA TRP MET LYS ARG PHE MET
19 │  ASP ASN ASP THR ARG TYR SER THR PHE ALA CYS GLU ASN
20 │  PRO ASN SER THR ARG VAL SER ASP PHE ARG THR ALA ASN
21 │  CYS
   ╰Residues found in ATOM definitions
The residues as defined in the SEQRES records do not match with the found residues, see above for details.
, LooseWarning: SEQRES residue total invalid
   ╷
   │ ./data/single_chain.pdb
   ╵
The residue total (261) for SEQRES chain "A" does not match the total residues found in the chain (262).
]', src/structure.rs:343:10
stack backtrace:
   0: rust_begin_unwind
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/panicking.rs:65:14
   2: core::result::unwrap_failed
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/result.rs:1791:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/result.rs:1113:23
   4: rust_sasa::structure::tests::from_pdbtbx
             at ./src/structure.rs:339:25
   5: rust_sasa::structure::tests::from_pdbtbx::{{closure}}
             at ./src/structure.rs:338:5
   6: core::ops::function::FnOnce::call_once
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/ops/function.rs:251:5
   7: core::ops::function::FnOnce::call_once
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/ops/function.rs:251:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
test structure::tests::from_pdbtbx ... FAILED

I have tried deleting the SEQRES sections, but I then get some other errors about the number of Atom and the number listed in the MASTER record don't match.

Surely this is a common state for PDB files, particularly ones from XRC since it is often hard to resolve densities, so ATOM records are not used.

The code I am running is:

let (pdb, _e) = pdbtbx::open(
            "./data/single_chain.pdb",
            pdbtbx::StrictnessLevel::Loose, // Even with loose it panics!
        )
        .unwrap();
OWissett commented 1 year ago

I also get this error:

thread 'structure::tests::from_pdbtbx' panicked at 'index out of bounds: the len is 67 but the index is 67', /home/sasa/.cargo/registry/src/github.com-1ecc6299db9ec823/pdbtbx-0.10.1/src/read/pdb/lexer.rs:723:31

After investigating the source code, the DBREF sections appear to be causing this problem.

Please let me know if I am doing something wrong? or is this an issue with the crate?

To me it seems like an issue with the crate, since I am using regular PDB files and following the documentation on how to load them

OWissett commented 1 year ago

Interim fix - DELETE the entire header

douweschulte commented 1 year ago

Any crash is an issue with the crate. I will look into what is causing the issues and see if I can get a fix in. Thanks for opening the issue!

douweschulte commented 1 year ago

It is a bit of an ugly fix but I lowered the level of the MASTER record mismatch so that ErrorLevel::Loose will not panic anymore on these errors. Because that was ultimately all that kept the file from running when I took the original PDB file. I can see you had more reported issues in your first crash, I could not replicate those. If these issues persist please send me the file you run so I can debug what is happening.

A bit of backstory, there are 2207 atoms (1931 ATOM + 276 HETATM of which 244 HOH) in the file but the MASTER record claims there are 2187. I have not yet found the definition which determines the MASTER record number.

As a side project I restructured the whole lexing code so that errors like the second crash are pretty much impossible (there are a handful of cases left out of the hundreds before). So thanks for the push I needed to clean up this code!

Side note: if you want to run the version of pdbtbx with this patch you will have to run it directly from git until a new release is made: pdbtbx = {git = "https://github.com/douweschulte/pdbtbx"}.

OWissett commented 1 year ago

I have just had a look at the updated version of the lexer compared to one crates.io, looks much better since we should not get index errors causing panics! which is much better :)

douweschulte commented 1 year ago

Perfect! If any of the issues pops up again or you find any new ones, please open an issue again. Thanks again for sending it in!