douweschulte / pdbtbx

A library to open/edit/save (crystallographic) Protein Data Bank (PDB) and mmCIF files in Rust.
https://crates.io/crates/pdbtbx
MIT License
48 stars 12 forks source link

Spatial query support #57

Closed douweschulte closed 3 years ago

douweschulte commented 3 years ago

It would be great if the library supported ways to get close atoms to specified points. Proposed is to do this by making an r*tree of all atoms upon request from the user fn create_rtree(self: &PDB). For this Atom should implement Point. Caution should be taken to avoid mutability problems as the rtree will be disentangled from the PDB hierarchy. https://docs.rs/rstar/0.8.2/rstar/struct.RTree.html

douweschulte commented 3 years ago

@DocKDE If you are still working on the example code from #52 you are maybe interested in this change where I built in support for Rstart trees into the library, that should make your code a little bit more readable. If possible having a bit of feedback on the exact names and the ergonomics of using the rtree in this way would be greatly appreciated! As a sidenote I submitted a PR on rstar to include tuples as Points Stoeoef/rstar/pull/57 if that is merged that should clean up the code a little bit more.

// This code can also be found as test 'spatial_lookup' in pdb.rs
let mut model = Model::new(0);
model.add_atom(Atom::new(false, 0, "", 0.0, 0.0, 0.0, 0.0, 0.0, "", 0).unwrap(),"A",(0, None),("MET", None));
model.add_atom(Atom::new(false, 1, "", 1.0, 1.0, 1.0, 0.0, 0.0, "", 0).unwrap(),"A",(0, None),("MET", None));
model.add_atom(Atom::new(false, 2, "", 0.0, 1.0, 1.0, 0.0, 0.0, "", 0).unwrap(),"A",(0, None),("MET", None));
let mut pdb = PDB::new();
pdb.add_model(model);

let tree = pdb.create_rtree();

assert_eq!(tree.size(), 3);
assert_eq!(
    tree.nearest_neighbor(&[1.0, 1.0, 1.0]).unwrap().serial_number(),
    1
);
assert_eq!(
    tree.locate_within_distance([1.0, 1.0, 1.0], 1.0).fold(0, |acc, _| acc + 1),
    2
);
let mut neighbors = tree.nearest_neighbor_iter(&pdb.atom(0).unwrap().pos_array());
assert_eq!(neighbors.next().unwrap().serial_number(), 0);
assert_eq!(neighbors.next().unwrap().serial_number(), 2);
assert_eq!(neighbors.next().unwrap().serial_number(), 1);
douweschulte commented 3 years ago

Thanks for comments! I will think about it a little more to try to include Residue information, and make some examples on how it could be done.

douweschulte commented 3 years ago

I think I have a solution for the missing PDB hierarchy information. I created a new struct FullHierarchy which contains references to the chain, residue, conformer which contain a single atom. In this way this struct gives all information you need (ie if the atom is from the backbone) while still being minimal in overhead and mutability problems. I also created some other functionalities which I saw fit for the FullHierarchy, like an iterator creating these for all atoms in a chain.

I created an extra rtree function create_full_hierarchy_rtree which creates an rtree filled with these FullHierarchys but behaving in exactly the same way as the other rtree in regard to the spatial queries.

In response to your question: I tend to use tuples for positions because tuples are defined to be a specific dimensionality and because they force the writer of the software to be specific which dimension they mean. There is no real reason why I could not switch to arrays except backwards compatibility. But as I think my PR will be merged (in a long time, or in the next fork of the rtree lib) I think sticking with pos_array for now is okay.

If you have some time can you help me with the following questions: Do you think this solves your problem? Would you want to use this struct more often in your code (if yes where)? Do you think this name is reasonable?

douweschulte commented 3 years ago

Thanks a lot for your reply, I really like the name you suggested and will certainly include an atoms_with_hierarchy function on the PDB.