douweschulte / pdbtbx

A library to open/edit/save (crystallographic) Protein Data Bank (PDB) and mmCIF files in Rust.
https://crates.io/crates/pdbtbx
MIT License
54 stars 17 forks source link

Possible bug parsing H atoms on nucleic structure #92

Closed brianjimenez closed 2 years ago

brianjimenez commented 2 years ago

When trying to parse a nucleic structure in PDB format, it seems than the number of atoms found is bigger than the number of the actual atoms in the PDB file. It seems from a few tests that hydrogen atoms are added multiple times to the internal PDB object.

Here it is a code showing the problem:

use std::env;

fn main() {

    let cargo_path = match env::var("CARGO_MANIFEST_DIR") {
        Ok(val) => val,
        Err(_) => String::from("."),
    };

    let test_path: String = format!("{}/tests", cargo_path);

    let structure_filename: String = format!("{}/nucleic.pdb", test_path);
    println!("Reading input structure: {}", structure_filename);
    let (structure, _errors) = pdbtbx::open(&structure_filename, pdbtbx::StrictnessLevel::Medium).unwrap();

    println!("{}", structure.atom_count());

    for atom in structure.atoms() {
        println!("{}", atom);
    }
}

And these are the last 10 lines of the output:

ATOM ID: H41, Number: 58, Element: H, X: 14.552, Y: 16.481, Z: 2.862, OCC: 1, B: 0, ANISOU: false
ATOM ID: H5'2, Number: 63, Element: H, X: 16.762, Y: 8.967, Z: -2.135, OCC: 0.03333333333333333, B: 0, ANISOU: false
ATOM ID: H42, Number: 59, Element: H, X: 14.094, Y: 15.743, Z: 4.072, OCC: 1, B: 0, ANISOU: false
ATOM ID: H5'2, Number: 63, Element: H, X: 16.762, Y: 8.967, Z: -2.135, OCC: 0.03333333333333333, B: 0, ANISOU: false
ATOM ID: H2'1, Number: 60, Element: H, X: 13.942, Y: 11.912, Z: -1.539, OCC: 1, B: 0, ANISOU: false
ATOM ID: H5'2, Number: 63, Element: H, X: 16.762, Y: 8.967, Z: -2.135, OCC: 0.03333333333333333, B: 0, ANISOU: false
ATOM ID: H2'2, Number: 61, Element: H, X: 12.539, Y: 11.338, Z: -1.169, OCC: 1, B: 0, ANISOU: false
ATOM ID: H5'2, Number: 63, Element: H, X: 16.762, Y: 8.967, Z: -2.135, OCC: 0.03333333333333333, B: 0, ANISOU: false
ATOM ID: H5'1, Number: 62, Element: H, X: 17.258, Y: 10.053, Z: -1.131, OCC: 1, B: 0, ANISOU: false
ATOM ID: H5'2, Number: 63, Element: H, X: 16.762, Y: 8.967, Z: -2.135, OCC: 0.03333333333333333, B: 0, ANISOU: false

I've prepared the full test ready to be executed: test_pdbtbx.tar.gz

Thank you in advance for your support, congratulations on the great work coding this library!

brianjimenez commented 2 years ago

Also happening when using pdbtbx::StrictnessLevel::Strict

douweschulte commented 2 years ago

That is very interesting I will take a look, thanks for raising the issue!

douweschulte commented 2 years ago

Okay I found it, in the parsing code the conformer ID (Residue Name) was not properly trimmed. This meant that for each atom in the residue there was a separate conformer. This in the end made it so that the library duplicated the last atom so that it would be present in all conformers.

I added your PDB and example code to the tests to make sure this bug cannot ever surface again. Thanks again for raising the issue.

brianjimenez commented 2 years ago

That was fast! Thank you @douweschulte for your quick reply and fix. Any plans to release a new version on crates.io any time soon?

douweschulte commented 2 years ago

I will create a new patch version later today. But if you need it for use in a rust project it is also possible to link to a git repo directly instead of a package on crates.io.

brianjimenez commented 2 years ago

Awesome, just seen it on crates.io 📦 Thanks a lot! 🍻