fix OpenMRG data transformer and extend use case

eoydvin commented 1 year ago

There was a problem with the data transformation of the OpenMRG code (merged in #56) which led to wrong combinations of CML time series and metadata. This will be fixed in this PR.

Additionally, the use case example notebook will be extended.

Edit (by @cchwala ):

fixes #16
fixes #60

github-actions[bot] commented 1 year ago

:point_left: Launch a binder notebook on branch _eoydvin/OPENSENSEsandbox/main

cchwala commented 1 year ago

Do I understand correctly that the current state of this PR does do all transformation of the OpenMRG dataset in the notebook to create the small example file?

That would mean that large parts of the code in opensense_data_downloader_and_transformer.py will be completely renewed. Would be good if you move the code over there. But, from now on we should really aim to first do #22 so that we can correctly add version numbers. Only when we have versioned data_transformer functions we should use them to create an OpenMRG NetCDF with the OPENSENSE convenctions so that the colleagues can update it on zenodo. It has to be perfectly 100% clear how the transformation was done.

cchwala commented 1 year ago

As we found yesterday, the rainfall fields in the latest version had a bit lower rainfall rates compared to one of the intermediate version which looked good, too.

Based on this time series that you show in the example notebook, I suspect that you forgot to interpolate very short gaps in the current version. Hope that brings back some of the peaks.

maxmargraf commented 1 year ago

Based on this time series that you show in the example notebook, I suspect that you forgot to interpolate very short gaps in the current version.
Indeed, we did not interpolate gaps up to 5 min in the latestet appraoch to reshape the data with xarray. @eoydvin, can add following lines to the create_small_openMRG_example.ipynb:
ds["tsl"] = ds.tsl.interpolate_na(dim="time", method="linear", max_gap="5min")
ds["rsl"] = ds.rsl.interpolate_na(dim="time", method="linear", max_gap="5min")

eoydvin commented 1 year ago

I moved the new transformation code to the default option in opensense_data_downloader_and_transformer.py. I also fixed the old OpenMRG transformation code, but decided to use the new one as it is faster and easier to understand. The old transformation code still seem relevant as it requires less memory.
I also added some tests to create_small_openMRG_example.ipynb to see that the transformation functions does the correct thing.
Interpolating very short gaps indeed seem to return the large peaks, see updated use case file.
I also added an animated comparison between radar and interpolated CML, and I think the results looks more reasonable now. See the end of openMRG_use_case. It could be that this animation should be part of another PR.

maxmargraf commented 1 year ago

Thanks for this effort @eoydvin! Some comments:

create_small_openMRG_example.ipynb:

the import of the opensense_data_downloader_and_transformer.py in Explore_existing_datasets.ipynb does not work as the oddt file lives in a parent dir. I suggest either something like a sys.path import, or to move the oddt also to the data subdir (changing the import in the other notebooks). This second method would leave the notebook dir a bit cleaner and therefore IMO the nicer solution here.
we could argue over ds = ds.resample(time = '1T', label = 'right').mean() in the create_small_openMRG_example. ipynb to derive 1 min data from openMRG, i.e. I would prefer to use instantaneous 1 min values, but that is a minor detail.

openMRG_use_case.ipynb

The exendance of vmax in the animation with the radar comparison returns white spots, this could be changed by an extend='both' argument.
I really like the additional showcase with radar data in the use case notebook and it is good to see that the maps turn out so nicely even in the high temporal resolution! For the use case we should however try to reduce the length and complexity of the notebook e.g. by 1) removing the last cell with the cartopy map (cartopy is not included in the env yaml right now anyway). 2) Interpolating the data only once and then showing only one animation (IMO the latere one with the radar comparison) 3) Preparing the radar data already in create_small_openMRG_example. ipynb.

After thinking twice on the use case an alternative could be that we split the notebook in half, dubbing the first half 'simple use case' and the second half, 'extended use case with comparison to radar data' and let it all live there.

cchwala commented 1 year ago

Thanks @eoydvin for the additional work and thanks @maxmargraf for the detailed comments. I have to catch up a bit and will provide my feedback soon.

In general, splitting up the use case, sounds like a good idea, i.e. one very basics example and a more complex one. But maybe this could also be one notebook, just with a more detailed part at the bottom.

First, it is most important to have one correct use case ;-)

cchwala commented 1 year ago

@eoydvin Please note that the radar-CML animation in the openMRG_use_case does not work because it relies on the radar NetCDF which is downloaded in other notebooks:

FileNotFoundError: [Errno 2] No such file or directory: '/home/jovyan/OPENSENSE_sandbox/notebooks/data/andersson_2022_OpenMRG/radar/radar.nc'

eoydvin commented 1 year ago

Thanks for the feedback! Some notes:

I left the opensense_data_downloader_and_transformer in its current directory as other notebooks might rely on it. Do I understand it correctly that it will be moved to a new repository for version tracking later?
I agree that resampling should be done using only one sample from each 5 min. This is fixed.
Now that there is good agreement between high resolution, should we change the simple use case interpolation to be in minutes instead of hours? Or will that be too much for a simple use case?
I have split the use case into a simple and a extended case. Should each case be able to run without the other? As it is now the simple case must be run before the extended case.
I downloaded a small piece of radar data (see create_small_openMRG_example) to the repository and loaded it to the extended use case. That should fix the FileNotFoundError.

maxmargraf commented 1 year ago

Thanks for the commit! I tried several times to launch the binder, unsuccessfully. Therefore, I did not test this commit yet.

Do I understand it correctly that it will be moved to a new repository for version tracking later?

For this sandbox, the oddt code will stay there IMO.

should we change the simple use case interpolation to be in minutes instead of hours?

I would leave this in an hourly base covering a whole day

Should each case be able to run without the other?

Having it in one notebook, with some explanations is fine for me.

eoydvin commented 1 year ago

For me the binder loads quickly, but from the main branch without the new transformation code. I think I remember it was able to load from my branch before..? The binder URL seem to point to the main branch.

cchwala commented 1 year ago

The binder URL seem to point to the main branch.

Note that the URL points to the main branch of the separate env repo, but the state of the branch of this PR is then pulled in when running on binder, see #31

I also could not start binder from this PR yesterday evening... Now it worked within some seconds...

cchwala commented 1 year ago

I ran the notebook. The processing looks good now. And the final animation is really great. The animation code cells are very complex, though.

Three minor things that could be improved:

Add two or three sentences below each header, e.g. explaining that the following cell will do complex things in matplotlib which are only needed for visualisation and are not important to understand CML rainfall estimation.
Look for typos, e.g. in the second heading the first "data" has to be deleted
If possible, without spending a full day on it (😇), it would be great to not squash the CML rainfall map when adding the colorbar in the final animation. A solution could be based on info from the matplotlib docs or from this stackoverflow response. Since the final animation looks so good, I would add it as gif on top of the READEM as a teaser for the sandbox content. Hence, spending some more time on it would be worth it.

For the rest, I agree with what @maxmargraf wrote above

eoydvin commented 1 year ago

Ok, I will attempt at cleaning the code and write a user-friendly explanation. I did not quite get if we should move the transformation function up one level to the data directory or not?

cchwala commented 1 year ago

I did not quite get if we should move the transformation function up one level to the data directory or not?

I would leave it where it is now. As you wrote somewhere above, some other notebooks expect it to be there. And everything in the sandbox is bit messy anyway. Soon we should do #22 and then things will be cleaned up anyway 🤞

cchwala commented 1 year ago

Thanks for yet another update! Sounds like this could be the last commit in this PR 👍

Unfortunately, again binder does not start for me... I will check it out tomorrow morning and then finally merge!

cchwala commented 1 year ago

One tiny detail. For the first animation the title of the plot has to be set like this

title.set_text("{}".format(str(cmls_R_1h.isel(time=n).time.values)[:19]))

and not using cmls.isel....

cchwala commented 1 year ago

There is also a problem with the data exploration notebook, see #64, but we fix that later. It might depend a bit on what kind of binder pod you get...

cchwala commented 1 year ago

I took another look. Everything is great now!

Thanks a lot @eoydvin for fixing the error and for taking the chance to improve and extend the usecase notebook! Also thanks to @maxmargraf for help with tracking down the error and for feedback in this PR! 👏 👏 👏

I am merging this now 🎉

OpenSenseAction / OPENSENSE_sandbox

fix OpenMRG data transformer and extend use case #62