Closed Sohambutala closed 2 weeks ago
It seems like the Blosc issue also occurs even when we append the Sv data to the Sv store. The above example is small and might not run into this problem when storing sv.
Some related discussions found:
https://discourse.pangeo.io/t/problem-with-blosc-decompression-and-zarr-files/2648
intermittent errors during blosc decompression of zarr chunks on pangeo.pydata.org
Also, found this discussion in echopype. @leewujung I know its long back but do you remember if there was a fix applied in echopype for this issue?
PR : Address memory issues in combine
At the request of @leewujung, here is a list of TODOs for this PR (some may not be included, but I will try to update this list as I remember them):
TODOs
[ ] Figure out RuntimeError: error during blosc decompression: -1
- This seems related to intermittent errors during blosc decompression of zarr chunks on pangeo.pydata.org pangeo-data/pangeo#196
- I think the reason for this error is that two processes are attempting to access the same chunk file.
- [x] run a selection of files from
noaa-wcsd-pds/data/raw/Bell_M._Shimada/SH1707/EK80
[x] run a selection of files from
noaa-wcsd-pds/data/raw/Bell_M._Shimada/SH2106/EK80
- [x] These runs triggered an error in the current PR code because the
channel
dimension is not always in ascending order. This is currently being corrected in issue Unorderedchannel
coordinate #815.[x] run a selection of files from
noaa-wcsd-pds/data/raw/Bell_M._Shimada/SH1701/EK60
[x] These files can have different sized
channel
dimensions. Thus, we need to discuss the possibility of allowing missing channels amongst the datasets being combined.[x] We have decided to not handle this scenario in this PR, instead we will consider this later (see issue Consider allowing
combine_echodata
to accept different sizedchannel
dimensions #819).- [x] run a selection of files from OOI associated with the AZFP echosounder
- [x] add attributes to each
EchoData
group- [x] create a check that all of the channels in each dataset are the same i.e. all channels must have the same names and length.
- [x] check for and correct reversed time
- [x] store old times if they needed to be corrected
- [x] Perform a minimal check that the first time value of each Dataset is less than the first time value of the subsequent Dataset
- [x] change lazy naming for class
- [x] Make sure the combine function works with different sized
range_sample
dimensions amongst the sensors- [x] change
filenames
(inProvenance
) numbering torange(len(filenames))
- [x] Add default
'compressor'
if it does not exist in the encoding of a variable.[x] Incorporate this new combine method into the echopype API
[x] Discuss the possibility of having a depreciation cycle for the old combine method. In our in person meeting, we have decided that we will just use the new combine method created in this PR going forward and we will remove the old combine method entirely.
[x] For the new API, we just need to add the parameter
path
(or some variant of it) toep.combine
. We will ensure that this path is appropriate using validate_output_path. Additionally, if no path is provided we will do a warning and mention that it is being written to a temporary folder (specified byvalidate_output_path
)The above item will make sure it works for s3 bucket writing
We may need to allow for
fs_map
input (make this a new issue/PR maybe)- [ ] Create integration tests for new combine function
A couple of additional items to note:
- The Provenance attribute variables are named differently in this PR
We introduce a routine that ensures all
ds_lists
have identical attributes. Runtime errors are raised if:
- The keys are not the same
- The values are not identical
- The keys
date_created
,conversion_time
do not have the same types- If attribute values are numpy arrays, then they will not be included in the
Provenance
group. Instead, these values will only appear in the attributes of the combinedEchoData
object.
So far, I have tried the following from echodataflow end but still facing the issue intermittently:
zarr.storage.default_compressor
to Zlibblosc.set_nthreads(1)
dask cluster
, since DaskChunkingManager
is being used by default even when not executing on dask cluster.I think this Blosc Decompression issue is more of a chunking error than a problem with Blosc Decompression. There were two ways I was able to get this to run: Rechunking and Computing before to_zarr
slice.chunk("auto").to_zarr(
store='./stores/slice.zarr',
mode="w",
consolidated=True,
storage_options={},
safe_chunks=False
)
slice.compute().to_zarr(
store='./stores/slice.zarr',
mode="w",
consolidated=True,
storage_options={},
safe_chunks=False
)
The former resets the chunks to full dimension width (i.e creates 1 big chunk) and the latter removes chunks altogether. Although, I'm not quite sure the mechanism(s) behind why the original slice
chunks were 'bad'.
I think the Blosc Decompression error is just a sign that something is going wrong with the chunking rather than the problem itself.
Yes, I forgot to mention that I’m now using the following approach to store the slice and pass it to the MVBS function instead of keeping the slice in memory:
nextsub.compute().to_zarr(
store=out_zarr,
mode="w",
consolidated=True,
storage_options=config.output.storage_options_dict,
safe_chunks=False,
)
However, the Blosc issue also occurs when writing Sv results to the store. I’m attaching the entire console logs. The slice operation happens after the write_output
stage.
I believe using .compute()
works because it forces the computation of the Dask array into memory, resolving any pending operations and converting it before writing it to disk. This helps avoid issues with chunking and parallel writes that likely lead to Blosc decompression errors. These errors might be occurring due to conflicts when multiple processes try to read from or write to the same chunk simultaneously, causing decompression failures.
Using the same approach for write_output
scenario might not be feasible though, since loading the entire store into memory every time we append a slice will dramatically affect performance.
Fixed with using zarr.sync.ThreadSynchronizer
from discussion today with @Sohambutala @leewujung @valentina-s
This issue could potentially be reopened if the Blosc error pops up again.
Description
Encountering an intermittent issue with Blosc decompression. The specific error message is:
This error occurs when slicing parts of two Zarr files and then storing the concatenated result to disk using to_zarr.
Steps to Reproduce: