a-renzini / pygwb

MIT License
3 stars 3 forks source link

[JOSS Review] Large git repository #2

Closed Sbozzolo closed 9 months ago

Sbozzolo commented 12 months ago

This repository is more than 800 MB. I think that the reason for this is that the pickle files are stored in the repo, and git cannot properly store the diff, so it has to store the entire file every time. A closer look shows that the file test/test_data/H1L1_1247644138-1247645038.pickle is the worst offender.

This is not sustainable, every time you regenerate the pickle, the repository grows by 30-80 MB.

Stack Exchange might help mitigate the problem, but probably the long-term solution is to do something with the pickle files (e.g., is all the data stored needed for the test?).

All sizes are in MB. The pack column is the size of the object, compressed, inside the pack file.
size  pack  SHA                                       location
79    72    ab702a64c2df098cf81aa782d7bd459442ced621  test/test_data/H1L1_1247644138-1247645038.pickle
79    72    218c857a2a6939b892e12d6b24d06ab5ed009601  test/test_data/H1L1_1247644138-1247645038.pickle
79    73    b525ce5cdfee0c2eb027d00a063bda2329054dfe  test/test_data/H1L1_1247644138-1247645038.pickle
79    73    3a39971a9ed5e8310cd5efe9931f3133dbfd679f  test/test_data/H1L1_1247644138-1247645038.pickle
67    61    2f8c4a9163fa7bee28c5022dbe8e149b45eb932e  tutorials/tutorial_data/H1L1_1267911343-1267912935.pickle
63    38    fbd7084349ecf246cec29eb199b848a2b2858700  test/test_data/H1L1_1247644138-1247645038.pickle
63    38    c027d346b00462dfa8947cb4d0ad5ebaaa4daae7  test/test_data/H1L1_1247644138-1247645038.pickle
63    38    0e00cfb8f7cc51244aed2f2551a9d49953d17712  test/test_data/H1L1_1247644138-1247645038.pickle
63    38    b05ec0f5437004ea0f2fa5a50d256affef1eb357  test/test_data/H1L1_1247644138-1247645038.pickle
55    53    00a5a3c9ccd8143b6e6e3c08649a1dfe0a8b28b3  test/test_data/detector_testdata_H1.pickle
a-renzini commented 10 months ago

Indeed - the data stored is all required for the tests, however I agree the format and size as is is not sustainable. I have significantly reduced the size of test/test_data/H1L1_1247644138-1247645038.pickle by simplifying the tests and storing smaller spectrograms/data products. The repo is 50MB and the LFS 370MB. However, when cloning the repo it is still large as the contents of py_die/.git/objects/pack contain refs to the larger file. Do you have suggestions on how to reduce these? I've tried to look for ways to "purge" the repo of old files but haven't found an obvious solution. In any case, this is not an issue for users as the tests (and test data) are not a part of the release. Please let me know if you have suggestions!

Sbozzolo commented 10 months ago

To properly solve this issue you'd probably have to rethink how you do tests. I agree that most users won't be affected directly by this, so this is not a blocking issue.

The older files will stay with git to give you the ability to roll back to previous commits without losing information. If you wanted to reduce the size of the previously stored objects, what you could do is backport your change for the test with test/test_data/H1L1_1247644138-1247645038.pickle to the first commit where that test was introduced, and rebase every following commit to use the smaller data. This is tedious and probably not worth it, so I'd advise against it.

I think that the repo is good as is for JOSS, but maybe you can leave this issue open and try to fix the root of the issue in the future.

a-renzini commented 9 months ago

I have a few ideas of how to rethink the tests but would take some effort - I had initially tried to have the tests generate the test data on the fly but this would create circular tests that are always going to be self consistent; I need to design a simpler way of checking each step of the pygwb calculations (e.g., instead of checking that entire arrays are identical, just checking that their sum is, and things like that...). I'll add an issue for now as you suggest and then I can address this in a future release.

a-renzini commented 9 months ago

(the pickle-related issue was already open and is #48)