Closed vanlankveldthijs closed 4 months ago
Thanks @vanlankveldthijs for the implementation. Following our discussion, the next steps can be:
subset
, which should be enough %%timeit
for a robust profilingAttaching here two example notebooks I made for comparison: notebooks.zip
Consolidated demo notebooks into one.
Fixed stm.copy issue (should've been stm.copy())
Changed demo notebook to only test subset operation and to test this on large and small chunks (relative to dataset) to see impact.
Fixed pytests by checking for existence of time in chunksizes.
Please check whether implementation could be nicer (better python).
Hi @vanlankveldthijs, thanks for the nice work and sorry for taking so long to review. I quite like the new notebooks. I think it is good for merging now.
As I have mentioned in the previous comment, the ruff workflow should have been fixed by Sarah in #66.
Two warnings in the notebook are documented in #74 and #73. They are not relevant for this PR and can be fixed later
To resolve issue #57
There are a few remaining issues to resolve.
1) The pull request has 2 example notebooks: one with and one without storing and reloading the ordered data before performing the timing tests. To be determined which is better. This one should be called
demo_order_stm.ipynb
and the other removed.2) It seems like stmat.copy does not have the desired result: when I try to run STM operations on the cloned data, there seems to be some parts missing (I think it was the whole .stm part). Solved by loading the whole dataset twice, but that is a bit of an ugly fix.
3) It seems that when using small chunk sizes, storing to zarr fails. Solved by storing using 1000 size chunks and rechunking after loading. This is a bit ugly.
4) Possible issue with storing the ordered data to zarr when this needs to overwrite existing data. I think shutil.retree also sometimes gives issues (maybe when the is no directory).
5) Timing tests using the prerequisite small chunk size (500 in this case) become very slow. Don't know whether we can really fix this though. Currently running the final timing tests on both notebooks. To be committed when done.