alchemistry / alchemtest

the simple alchemistry test set
https://alchemtest.readthedocs.io
BSD 3-Clause "New" or "Revised" License
8 stars 12 forks source link

AMBER - in "simplesolvated" directory all files are not compressed #68

Closed DrDomenicoMarson closed 1 year ago

DrDomenicoMarson commented 2 years ago

All the files neded for testing are compressed as .bz2 files, except the ones inside the simplesolvated directory. Is there a reason behind this? We could spare some space compressing also these files.

xiki-tempula commented 2 years ago

@orbeckst Maybe you could comment on this as well.

I'm leaning towards having all of them as uncompressed files. I mean yes bz2 will make the file smaller so save some time in downloading the test files but these time is then spent on unzipping the files and writing it to a local file before reading it. At the end of the day, they are not going to drastically decrease the CI time.

Considering that we are pulling the files from alchemtest/archive/master.zip, file > bz2 > zip might not really shrink much space compared with file > zip.

On the other hand, saving the files as bz2 would mean that the user has to do the unzipping, which seems like extra effort to me to be honest.

DrDomenicoMarson commented 2 years ago

I LOVE this suggestion :) I was afraid you preferred to have all compressed files, hence the issue, but I completely agree that's much easier to create and check tests with uncompressed files!

orbeckst commented 2 years ago

I'd like to keep them compressed (or compresse them if not done so already) so that the footprint of the test package is small.

We never uncompress files to disk. We decompress on the fly with anyopen() https://github.com/alchemistry/alchemlyb/blob/cb965ad049556dc97445acc16cbb21a2bf3f7f38/src/alchemlyb/parsing/util.py#L20

Do you have timing information that shows a marked difference between compressed and uncompressed test files?

DrDomenicoMarson commented 2 years ago

Ok, I'll compress the files in current and future PRs then!

xiki-tempula commented 2 years ago

@orbeckst So I did a test with bz2

import time
from alchemlyb.parsing.amber import extract_u_nk
from alchemtest.amber import load_bace_example
file = load_bace_example().data['solvated']['decharge'][0]
a = time.time()
for i in range(1000):
    extract_u_nk(file, 298)
b = time.time()
print(b-a)

gives 43.86949706077576

with bz2.open(file, "rt") as bz_file:
    text = bz_file.read()
    with open('test.out', 'w') as f:
        f.write(text)
a = time.time()
for i in range(1000):
    extract_u_nk('test.out', 298)
b = time.time()
print(b-a)

gives 24.963331937789917 So running bz2 will make it 75% slower.

With regard to the zip file size, I have created a branch where all bz2 files have been unzipped (https://github.com/xiki-tempula/alchemtest/tree/feat_unzip), the zip file has a size of 82.9 MB which is slightly larger than the original zip file of 73.9 MB but I don't think it is too bad?

Plus, it will avoid https://github.com/alchemistry/alchemtest/pull/71#issuecomment-1273639362

orbeckst commented 2 years ago

I agree that the difference in download size is not an issue for alchemlyb where we take the zip. Maybe that's even true of whl etc (not sure if they are compressed). However, when installed, the difference is sizable

 81M    0.6.0.tar.gz
 94M    alchemtest-0.6.0             # as dowloaded
304M    alchemtest-0.6.0-unpacked    # manually unpacked all gz and bz2 files

so in order to keep the installed footprint down, I'd really like to keep datafiles compressed. (This is not important for CI but for users/developers.)

The data that you showed https://github.com/alchemistry/alchemtest/issues/68#issuecomment-1264606804 indicate overhead and that's interesting. I don't know if loading the data is the bottleneck for most tests — it might well be — and if this is the case then unpacking the data files might cut CI time by 25% - 50%. However, I think that the GROMACS reader will not be affected as much as it uses pandas.read_csv(), which is already much faster than our manually written ASCII readers (but I don't know how much it is affected by compression).

I ran pytest -v --durations=20 (on my laptop with SSD) to see where we spend time (we can add --durations to the alchemlyb CI for more information): here are the 20 slowest tests:

============================================================ slowest 20 durations =============================================================
34.28s setup    src/alchemlyb/tests/test_workflow_ABFE.py::Test_automatic_ABFE::test_read
32.75s setup    src/alchemlyb/tests/test_workflow_ABFE.py::Test_manual_ABFE::test_read
9.33s call     src/alchemlyb/tests/test_workflow_ABFE.py::Test_manual_ABFE::test_convergence_method
9.30s call     src/alchemlyb/tests/test_workflow_ABFE.py::Test_automatic_ABFE::test_convergence_method
7.18s call     src/alchemlyb/tests/test_workflow_ABFE.py::Test_manual_ABFE::test_convergence_nosample_u_nk
5.16s setup    src/alchemlyb/tests/test_workflow_ABFE.py::Test_automatic_benzene::test_read
4.61s call     src/alchemlyb/tests/test_visualisation.py::test_plot_dF_state
2.83s setup    src/alchemlyb/tests/test_workflow_ABFE.py::Test_automatic_amber::test_summary
2.74s call     src/alchemlyb/tests/test_workflow_ABFE.py::Test_manual_ABFE::test_estimator_method
2.72s call     src/alchemlyb/tests/test_workflow_ABFE.py::Test_automatic_ABFE::test_estimator_method
1.99s setup    src/alchemlyb/tests/test_fep_estimators.py::TestMBAR::test_mbar[X_delta_f5]
1.96s call     src/alchemlyb/tests/parsing/test_gmx.py::test_u_nk_with_potential_energy
1.88s call     src/alchemlyb/tests/parsing/test_gmx.py::test_u_nk_with_total_energy
1.85s call     src/alchemlyb/tests/test_visualisation.py::test_plot_ti_dhdl
1.81s setup    src/alchemlyb/tests/test_fep_estimators.py::TestBAR::test_bar[X_delta_f1]
1.79s call     src/alchemlyb/tests/parsing/test_gmx.py::test_u_nk_without_energy
1.78s setup    src/alchemlyb/tests/test_fep_estimators.py::TestAutoMBAR::test_mbar[X_delta_f5]
1.73s setup    src/alchemlyb/tests/test_fep_estimators.py::TestAutoMBAR::test_mbar[X_delta_f8]
1.72s setup    src/alchemlyb/tests/test_fep_estimators.py::TestMBAR::test_mbar[X_delta_f8]
1.72s call     src/alchemlyb/tests/test_fep_estimators.py::TestMBAR::test_mbar[X_delta_f4]
=============================================== 383 passed, 3141 warnings in 238.44s (0:03:58) ================================================

Probably anything labelled setup is related to reading files (possibly also related to https://github.com/alchemistry/alchemlyb/issues/206 ) but I cannot discern from the output if we're reading GMX or AMBER in these tests.

@xiki-tempula is your main argument for uncompressing that we want to make CI run faster?

If so, then the question is by how much it would really speed up things. And then one has to decide what tradeoffs one wants to make.

orbeckst commented 1 year ago

I'll close this issue as "won't fix" for now. Please chime in if you want to re-open the discussion.