Open cchwala opened 2 years ago
Info on what a local conda develop cml_data_explorer/
produced
/Users/chwala-c/mambaforge/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.8) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
added /Users/chwala-c/code/OPENSENSE_sandbox/cml_data_explorer
completed operation for: /Users/chwala-c/code/OPENSENSE_sandbox/cml_data_explorer
Datashading for time serie added in https://github.com/cchwala/cml_data_explorer/commit/a2b5777529a899e82c0adb9ddf9025bdfea12bc0
note: have to add datashader
to env
I am still fighting the problems of running out of memory on binder when working with the OpenMRG dataset. Since its NetCDF is not chunked one cannot quickly iterate through the CMLs. Loading all data into RAM is not an option on binder.
I now explored the option to write data from the original NetCDF piece by piece into a chunked zarr store (because zarr is fast and appending to an existing store is easy) with this code
N_slice_length = 50
for i in range(0, len(ds.cml_id), N_slice_length):
print(f'indexing data for CML_ID index from {i} to {i + N_slice_length}')
ids = ds.cml_id.isel(cml_id=slice(i, i + N_slice_length))
print(f'covered CML_IDs range from {ids[0].values} to {ids[-1].values}')
print('loading data... (this will take appox. 10 to 20 seconds)')
ds_subset = ds.isel(cml_id=slice(i, i + N_slice_length)).load()
ds_subset = ds_subset.chunk({'cml_id': 1})
ds_subset.rsl.attrs.pop('missing_value')
ds_subset.tsl.attrs.pop('missing_value')
ds_subset.polarization.attrs.pop('missing_value')
if i==0:
ds_subset.to_zarr('foo.zarr')
else:
ds_subset.to_zarr('foo.zarr', append_dim='cml_id')
del ds_subset
print('')
This works fine and takes only some minutes. Afterwards one can open the zarr storage and work with a Dataset
chunked by cml_id on disk an via dask.
However, when doing computations, I still run into problems. Calculations are still slow
And doing things like
ds['tl'] = ds.tsl - ds.rsl
which would speed up plotting, will make binder run out of memory.
But I think I can now come up with a somehow usable solution.
todo: add zarr
to env definition in env repo
For the record, because I have been dealing with killed kernel due to memory consumptions in this issue and others have also experienced the same thing (see #37), here is a link to an article
Detecting CPU and RAM limits on mybinder.org
The problem seems to be the one is restricted to 1 CPU and limited RAM in a binder pod, but when requesting info on the available resources one, get's what is physically available on the machine that hosts the pod. Hence, calculations or plotting will cause OOM kills (OOM = out of memory) because Python (or the used packages) think that they can comfortably increase memory usage further, but then the kernel dies e.g. at 2 GB RAM usage.
Update: I made a new issue for that, see #46
notebook works with current commit on binder.
Add interactive CML data explorer example, see #28
TODO
related to sandbox repo:
related to changes required in cml_data_explorer code: