Closed accowa closed 1 month ago
Successfully wrote a whole year of eORCA12 1 day T-grid mean fields to a bucket (s3://npd12-j001-t1d-1976). Each month was successfully appended in turn but the whole sequence took over 2 hours to complete. This isn't going to be feasible for all the output (remaining U- V- 1d means, all 5d means, all monthly means and annual means) unless the parallelisation via dask allows significant speed-up and is reliable.
One observed oddity is that the time_counter dataset has a chunk size of 31 (presumably fixed by the length of January?). This is despite the explicit request of:
-cs '{"time_counter": 1, "x": 720, "y": 360}'
in each call. Maybe this is related to the unwarranted warning? Chunksizes have been respected elsewhere; just not for time_counter itself. E.g.:
ncdump -h -s https://noc-msm-o.s3-ext.jc.rl.ac.uk/npd12-j001-t1d-1976/T1d/#mode=nczarr,s3
netcdf \#mode\=nczarr\,s3 {
dimensions:
y = 3605 ;
x = 4320 ;
nvertex = 4 ;
time_counter = 366 ;
axis_nbounds = 2 ;
.
.
double time_centered(time_counter) ;
time_centered:bounds = "time_centered_bounds" ;
time_centered:calendar = "gregorian" ;
time_centered:long_name = "Time axis" ;
time_centered:standard_name = "time" ;
time_centered:time_origin = "1900-01-01 00:00:00" ;
time_centered:units = "seconds since 1900-01-01 00:00:00" ;
time_centered:_Storage = "chunked" ;
time_centered:_ChunkSizes = 1 ;
time_centered:_Filter = "32001,0,0,0,0,5,1,1" ;
time_centered:_Codecs = "[{\"blocksize\": 0, \"clevel\": 5, \"cname\": \"lz4\", \"id\": \"blosc\", \"shuffle\": 1}]" ;
time_centered:_Endianness = "little" ;
double time_counter(time_counter) ;
time_counter:axis = "T" ;
time_counter:bounds = "time_counter_bounds" ;
time_counter:calendar = "gregorian" ;
time_counter:long_name = "Time axis" ;
time_counter:standard_name = "time" ;
time_counter:time_origin = "1900-01-01 00:00:00" ;
time_counter:units = "seconds since 1900-01-01" ;
time_counter:_Storage = "chunked" ;
VVVVVVVVVVVVVVVVVVVVVVV
time_counter:_ChunkSizes = 31 ;
^^^^^^^^^^^^^^^^^^^^^^^^^
time_counter:_Filter = "32001,0,0,0,0,5,1,1" ;
time_counter:_Codecs = "[{\"blocksize\": 0, \"clevel\": 5, \"cname\": \"lz4\", \"id\": \"blosc\", \"shuffle\": 1}]" ;
time_counter:_Endianness = "little" ;
float tossq_con(time_counter, y, x) ;
tossq_con:cell_methods = "time: mean (interval: 300 s)" ;
tossq_con:coordinates = "time_centered nav_lat nav_lon" ;
tossq_con:interval_operation = "300 s" ;
tossq_con:interval_write = "1 d" ;
tossq_con:long_name = "square_of_sea_surface_conservative_temperature" ;
tossq_con:missing_value = 1.00000002004088e+20 ;
tossq_con:online_operation = "average" ;
tossq_con:standard_name = "square_of_sea_surface_temperature" ;
tossq_con:units = "degC2" ;
tossq_con:_Storage = "chunked" ;
tossq_con:_ChunkSizes = 1, 360, 720 ;
tossq_con:_Filter = "32001,0,0,0,0,5,1,1" ;
tossq_con:_Codecs = "[{\"blocksize\": 0, \"clevel\": 5, \"cname\": \"lz4\", \"id\": \"blosc\", \"shuffle\": 1}]" ;
Hi @accowa ,
I included that warning to inform users that once new data is added to the object store, rechunking is not possible. However, I’ll adjust the warning so that it only appears when users attempt to modify the chunking strategy.
Regarding the time_counter
data, I’ll investigate why it isn't being chunked the same way as the other variables. From a quick review of the code, there doesn’t seem to be a specific reason for excluding time_counter
from chunking.
In my opinion, given that it's a 1D array of time data, we may not need to chunk it. If you prefer, I can include it in the chunking strategy.
As for the execution time, Dask does speed up data uploads by parallelizing the transfer of each variable, but I am not sure on how much the performance will increase. However, there are two main bottlenecks right now:
Integrity check: The checksum calculation is quite slow because it requires converting all the data into bytes before determining the size. I'm currently using np.frombuffer
for this, but I'll test converting the np.array
to a dask.array
to see if it improves performance. I'll include this in the new pull request. Additionally, I’m considering adding a flag to allow users to disable the integrity check if needed. Do you think that would be a helpful option?
Data reprojection: This step is also time-consuming, but since you’re not running it, this shouldn't be the issue in your case.
Please let me know if you're happy with these changes.
The time_counter variable itself doesn't really need to be chunked but a chunk size of 1 does need to be applied to the time_counter dimension because most accesses of 2 and 3d slices/volumes will be done a time-slice at a time. All variables except time_counter have been chunked correctly so it is odd that time_counter has been treated differently. We should get to the bottom of that.
Yes, I think a flag to switch off the integrity check would be useful. The checks are useful when establishing new work flows but we may need to take risks to achieve a workable solution. At least such a flag will identify if the checks are indeed a major bottleneck.
If it helps, I see the priority order as: 1. flag to suppress integrity checks 2. Dask robustness 3. time_counter chunking. 4. Suppress unwarranted warning
Hi @accowa ,
I've implemented the following changes:
Flag to suppress integrity checks: Added a new flag skip_integrity_check
(-si
or --skip-integrity-check
). By default, it's set to False
(integrity checks are applied). I'll update the NEMO_CYLC documentation and codes accordingly.
Dask integration for checksum calculation: I've integrated Dask for the checksum calculation, which in my tests has shown a 2-5x speed improvement. However, this performance boost will vary depending on the machine and whether Dask is already being used to load data into the object store.
time_counter
chunking: I’ve added debug output (it is not on the code in production) to investigate why the time_counter
variable isn’t being chunked as expected:
if len(new_chunking.keys()) > 0:
print(f"Rechunking {variable} to {new_chunking}")
ds_filepath[variable] = ds_filepath[
variable
].chunk(new_chunking)
print(f"New chunking: {ds_filepath[variable].chunks}")
The output I'm seeing is:
Rechunking tossq_con to {'time_counter': 1, 'x': 720, 'y': 360}
New chunking: ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 5), (720, 720, 720, 720, 720, 720))
Rechunking nav_lat to {'x': 720, 'y': 360}
New chunking: ((360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 5), (720, 720, 720, 720, 720, 720))
Rechunking nav_lon to {'x': 720, 'y': 360}
New chunking: ((360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 5), (720, 720, 720, 720, 720, 720))
Rechunking time_centered to {'time_counter': 1}
New chunking: ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),)
Rechunking time_counter to {'time_counter': 1}
New chunking: None
It seems that time_counter
isn't being chunked even though the code runs without errors. To clarify, the time_counter
variable is not the same as the time_counter
dimension—it merely contains the names of each time_counter
timestamp. While it's odd that time_counter
is being handled differently, it doesn’t impact the chunking strategy or data itself. I’ll open a new issue to address this in more detail later.
Suppress unwarranted warnings: I’ve adjusted the warning so it will only appear if the chunk strategy being applied differs from the original one. No warnings will be shown if the chunk strategy remains unchanged.
The time_counter non-chunking is likely a Xarray bug. This looks relevant: https://github.com/pydata/xarray/issues/6204
Thanks for pointing me to this link. Considering this, I think the only way to solve this problem now is to rename the coordinate (I don't recommend this), or use the zarr library (and not xarray) the first time I upload the data. In this case, right after uploading the data, I would open the time_counter variable and rechunk it to the chosen strategy. The other data that will be appended later will automatically follow the chunk strategy defined in the first upload. I will add this information to the new issue #14
Version 0.1.1 seems to be working much better in intial tests. I'm just writing a series of monthly 1d means to the same bucket with successive calls with identical arguments (except for the input filename). I'm getting this warning on the second and subsequent calls:
☁ msm_os ☁ | WARNING | 2024-09-13 15:49:49 | You already have data in the object store and you can't rechunk it
which is unwarranted since the chunk settings haven't changed?