douweschulte / pdbtbx

A library to open/edit/save (crystallographic) Protein Data Bank (PDB) and mmCIF files in Rust.
https://crates.io/crates/pdbtbx
MIT License
49 stars 12 forks source link

hacky workaround for using set membership check when adding new atoms #86

Closed DocKDE closed 2 years ago

DocKDE commented 2 years ago

Fixes #81 I hacked together a very dirty workaround for the performance issue with traversing the Residues. This should not be merged obviously but I wanted to provide a proof of concept. It does not work for mmCIFs and breaks the tests and examples. The main idea is to circumvent the issues that arise when using some sort of Set for storing Atoms/Residues etc. and instead create a HashSet on the side that holds the Residue ID that were already parsed into the PDB struct. Whenever a new Atom is added a membership check is first performed so the existing Residues are only traversed (in reverse order) if this returns nothing. I did a quick profiling and this seems to remove the bottleneck of add_atom pretty thoroughly. Thoughts on the concept?

Edit: I checked with that ghastly 1HTQ PDB file and with the change I made, it takes ~3 seconds to parse it.

douweschulte commented 2 years ago

Yes a very nice concept. If no single collection can have the behaviour we want, constant time lookup and mutable access and iterators, then combing two collections to get this behaviour is the best option. The downside is obviously that there will need to be internal bookkeeping to keep both collections in agreement with each other, especially in a thread safe way.

DocKDE commented 2 years ago

Turns out I introduced a bug. My "workaround" didn't parse anything because I mixed up booleans. With the fix I just pushed it still seems faster, though.

DocKDE commented 2 years ago

Yes a very nice concept. If no single collection can have the behaviour we want, constant time lookup and mutable access and iterators, then combing two collections to get this behaviour is the best option. The downside is obviously that there will need to be internal bookkeeping to keep both collections in agreement with each other, especially in a thread safe way.

Yes, we're gonna have to live with some drawbacks here, it seems. The most annoying thing with what I did is that I had to pass the HashSet containing all the Residues down the function call stack so the Chain::add_atom function has access to it. This doesn't seem particularly elegant to me and makes the function less usable for end users. Do you have a suggestion of where else to store the HashSet? Also, the same approach may be necessary for the Model::add_atom function which will take a long time if there are many chains in the PDB structure.

douweschulte commented 2 years ago

I would suggest to store the set in the struct itself, in that way the end user would not know anything about it and the set can be used to speed up other things as well. But that opens up more things to consider in respect to keeping both collections in the same state.