choderalab / perses

Experiments with expanded ensembles to explore chemical space
http://perses.readthedocs.io
MIT License
178 stars 50 forks source link

Serialized XML objects in small molecule pipeline have the wrong extension #1157

Closed ijpulidos closed 1 year ago

ijpulidos commented 1 year ago

When we serialize objects in https://github.com/choderalab/perses/blob/4e36a6b1f9f5588dc06c5eda2f3f037d1183f007/perses/app/setup_relative_calculation.py#L733-L735

They are supposed to be gzipped files, but while inspecting these files one can easily tell they are directly the XML files. As in:

❯ file *
complex-hybrid-system.gz: XML 1.0 document, ASCII text
complex-new-system.gz:    XML 1.0 document, ASCII text
complex-old-system.gz:    XML 1.0 document, ASCII text
solvent-hybrid-system.gz: XML 1.0 document, ASCII text
solvent-new-system.gz:    XML 1.0 document, ASCII text
solvent-old-system.gz:    XML 1.0 document, ASCII text

Instead of the expected "gzip compressed data".

mikemhenry commented 1 year ago

Do we want to have them zipped, or do we want to keep them uncompressed?

ijpulidos commented 1 year ago

We do want to compress them.

mikemhenry commented 1 year ago

Okay this was a fun one: https://github.com/choderalab/perses/blob/main/perses/utils/data.py#L114-L127

We do save the xml with gzip, but then since there is an if instead of elif when checking for bz2, we overwrite the file with an uncompressed version when we hit the else block. I've got a PR incoming with some extra debug that will help troubleshoot issues like this.