OpenSenseAction / OPENSENSE_sandbox

Collection of runable examples with software packages for processing opportunistic rainfall sensors
BSD 3-Clause "New" or "Revised" License
13 stars 16 forks source link

[WIP] Add cml data explorer #44

Open cchwala opened 2 years ago

cchwala commented 2 years ago

Add interactive CML data explorer example, see #28

TODO

related to sandbox repo:

related to changes required in cml_data_explorer code:

github-actions[bot] commented 2 years ago

Binder :point_left: Launch a binder notebook on branch _cchwala/OPENSENSE_sandbox/add_cml_dataexplorer

cchwala commented 1 year ago

Info on what a local conda develop cml_data_explorer/ produced

/Users/chwala-c/mambaforge/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.8) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
added /Users/chwala-c/code/OPENSENSE_sandbox/cml_data_explorer
completed operation for: /Users/chwala-c/code/OPENSENSE_sandbox/cml_data_explorer
cchwala commented 1 year ago

Datashading for time serie added in https://github.com/cchwala/cml_data_explorer/commit/a2b5777529a899e82c0adb9ddf9025bdfea12bc0

cchwala commented 1 year ago

note: have to add datashader to env

cchwala commented 1 year ago

I am still fighting the problems of running out of memory on binder when working with the OpenMRG dataset. Since its NetCDF is not chunked one cannot quickly iterate through the CMLs. Loading all data into RAM is not an option on binder.

I now explored the option to write data from the original NetCDF piece by piece into a chunked zarr store (because zarr is fast and appending to an existing store is easy) with this code

N_slice_length = 50

for i in range(0, len(ds.cml_id), N_slice_length):
    print(f'indexing data for CML_ID index from {i} to {i + N_slice_length}')
    ids = ds.cml_id.isel(cml_id=slice(i, i + N_slice_length))
    print(f'covered CML_IDs range from {ids[0].values} to {ids[-1].values}')
    print('loading data... (this will take appox. 10 to 20 seconds)')
    ds_subset = ds.isel(cml_id=slice(i, i + N_slice_length)).load()
    ds_subset = ds_subset.chunk({'cml_id': 1})
    ds_subset.rsl.attrs.pop('missing_value')
    ds_subset.tsl.attrs.pop('missing_value')
    ds_subset.polarization.attrs.pop('missing_value')
    if i==0:
        ds_subset.to_zarr('foo.zarr')
    else:
        ds_subset.to_zarr('foo.zarr', append_dim='cml_id')
    del ds_subset
    print('')

This works fine and takes only some minutes. Afterwards one can open the zarr storage and work with a Dataset chunked by cml_id on disk an via dask.

However, when doing computations, I still run into problems. Calculations are still slow

Bildschirmfoto 2022-11-22 um 14 50 02

And doing things like

ds['tl'] = ds.tsl - ds.rsl

which would speed up plotting, will make binder run out of memory.

But I think I can now come up with a somehow usable solution.

cchwala commented 1 year ago

todo: add zarr to env definition in env repo

cchwala commented 1 year ago

For the record, because I have been dealing with killed kernel due to memory consumptions in this issue and others have also experienced the same thing (see #37), here is a link to an article

Detecting CPU and RAM limits on mybinder.org

The problem seems to be the one is restricted to 1 CPU and limited RAM in a binder pod, but when requesting info on the available resources one, get's what is physically available on the machine that hosts the pod. Hence, calculations or plotting will cause OOM kills (OOM = out of memory) because Python (or the used packages) think that they can comfortably increase memory usage further, but then the kernel dies e.g. at 2 GB RAM usage.

Update: I made a new issue for that, see #46

cchwala commented 1 year ago

notebook works with current commit on binder.