PDB-REDO / libcifpp

Library containing code to manipulate mmCIF and PDB files
https://pdb-redo.github.io/libcifpp/
BSD 2-Clause "Simplified" License
29 stars 10 forks source link

cif::pdb::reconstruct_pdbx is very slow #64

Open Augustin-Zidek opened 2 months ago

Augustin-Zidek commented 2 months ago

Hello, many thanks for the development and maintenance of libcifpp!

I've noticed that cif::pdb::reconstruct_pdbx is very slow. E.g. on 7soy mmCIF file from the PDB it takes < 0.2 seconds to parse, but running cif::pdb::reconstruct_pdbx on it takes roughly 4.5 seconds, i.e. a 20x slow-down if one wants to perform the correctness check/autofix.

Vast majority of the time is spent in cif::compound_factory::create:

image

Could that time be reduced? Also, cif::compound_factory::create seems to be called from multiple places. Would it make sense to cache that load?

I think that this could also be sped up if the CCD was compressed using zstd instead of gzip, as it decompresses much faster.

mhekkel commented 2 months ago

Could it be that your components.cif file is compressed? What happens if you extract that file, the one in /var/cache/libcifpp, does that help?

mhekkel commented 1 month ago

You mentioned using zstd. That's a good suggestion, but the point is, when you use the bundled script to update components.cif it will write out a file uncompressed. Removing the need for decompression entirely.

mhekkel commented 1 month ago

As a reference, cif-validate on 7soy takes 0.2 seconds on my laptop:

$ time build/cif-validate /tmp/7soy.cif.gz

real    0m0,246s
user    0m0,239s
sys 0m0,007s