COSIMA / cosima-recipes

A cookbook of recipes (i.e., examples) for analysing ocean and sea ice model output
https://cosima-recipes.readthedocs.io
Apache License 2.0
46 stars 66 forks source link

Converting notebooks from COSIMA Cookbook to ACCESS-NRI intake catalog #313

Open navidcy opened 10 months ago

navidcy commented 10 months ago

Why we need to have two ways for people to load output? At the moment, after #298, we have tutorials both for the cookbook and for the ACCESS-NRI Intake catalog

https://cosima-recipes.readthedocs.io/en/latest/tutorials.html

Let's just leave the best of the two. Is ACCESS-NRI Intake catalog the future? Let's make that the default in all examples and drop the cookbook then?

cc @angus-g, @aidanheerdegen, @AndyHoggANU, @aekiss, @adele-morrison, @anton-seaice, @PaulSpence, @rmholmes, @edoddridge, @micaeljtoliveira

Recipes

Tutorials

ACCESS-OM2-GMD-Paper-Figs

navidcy commented 10 months ago

310 seems relevant

aidanheerdegen commented 10 months ago

Largely this is up to the COSIMA community, but it is likely a choice you will be forced to make for a few reasons:

  1. The Cookbook Database does not scale well, particularly in the time taken to index data
  2. The infrastructure which runs the indexing will likely cease to exist by the end of the first quarter of 2024 (the accessdev Virtual Machine is scheduled to be shuttered, and this is what provides the outward facing web address for the jenkins service)
  3. There are no resources to do more than just maintain the Cookbook DB (@micaeljtoliveira does what he can, but he has many other high priority tasks)
  4. ACCESS-NRI has chosen to support the Intake Catalogue (in part for some of the reasons above)

So I'd suggest taking this to a COSIMA meeting and making the community aware that a decision needs to be made, and probably the sooner the better.

One downside of dropping the Cookbook is there is currently no GUI interface for searching the Intake Database. However there are some powerful methods for searching (filtering) the intake catalogue

https://access-nri-intake-catalog.readthedocs.io/en/latest/usage/quickstart.html#data-discovery

Also there are plans for some tools to assist with data discovery from ACCESS-NRI, and we're working on prototypes we hope to share soon.

navidcy commented 10 months ago

Thanks @aidanheerdegen! I'm happy to hear also voices from others here and then take it to a COSIMA meeting -- sounds good!

adele-morrison commented 10 months ago

The advice we had from ACCESS-NRI on this back around the workshop last year was that at that time they didn’t advise switching everything to intake catalogue, because it was so new and untested. But @aidanheerdegen @dougiesquire if you think the intake catalogue is ready for us to switch everything over completely, i think we should do that. We discussed this previously at a COSIMA meeting and agreed we should switch when the time was right (which is maybe now?).

anton-seaice commented 10 months ago

Not much to add from me:

dougiesquire commented 10 months ago

I think @max-anu has been modifying the existing recipes to use the catalog. It would be good to get an update from him. There are definitely datasets that are in the cookbook but not yet in the catalog. I think @max-anu will probably have the best idea of what these are.

aidanheerdegen commented 10 months ago

@anton-seaice makes a good point about retaining existing capability for those who rely on it, identifying deficiencies and setting up a new testing infrastructure.

@dougiesquire is also correct that datasets that are missing from intake should be identified.

@adele-morrison I'm not in a position to say if the intake catalogue is ready or not, but we're rapidly approaching a point where there isn't really an option, so we need to make sure it is.

So, synthesising from above, some steps to consider:

Might be good to break this out into a GitHub project, make some issues etc. I would do this, but I'm not in a position to assist much I'm afraid.

Thomas-Moore-Creative commented 10 months ago

Could the culture be shifted going forward so that when someone / a group runs an experiment that is meant to be open to the community that creating an Intake-ESM datastore is part of the process of making it available and documented? My view is that it's the "authors" of specific experiments that are best placed to create and update datastores for the catalog?

@dougiesquire appears to have done a great job documenting how that can be done for those who've never built a catalog before > I want to catalog my own data to make it easy to find and load

This new catalog is a big step forward, IMO, but will require a community effort to maintain, in my view.

rbeucher commented 10 months ago

Hi All,

So from the MED team perspective, I asked @max-anu to look at converting the recipes to use the intake catalog. The main reason for going ahead with this was precisely to identify the gaps that are mentioned above (missing functionalities, datasets etc.) and to push for a broader use of the intake-catalog. @max-anu should be able to give us an update.

310 is supposed to be a draft and everyone is welcome to participate and/or comment on it.

Happy to discuss the plan going ahead.

rbeucher commented 10 months ago

I have started a repo on the ACCESS-NRI organisation to automate testing of the recipes: https://github.com/ACCESS-NRI/COSIMA-recipes-workflow

This is based on what I have set up for the ESMValTool recipes:

https://github.com/ACCESS-NRI/ESMValTool-workflow?tab=readme-ov-file#recipes-current-status

Will keep you posted.

access-hive-bot commented 7 months ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/workshop-on-community-driven-practices-for-analysing-coding-and-sharing-cosima-hackathon-v4-0/2047/6

dougiesquire commented 6 months ago

Will keep you posted.

@rbeucher do you have any updates? It would be great to get an update on where @max-anu has got to so we can try to progress this in the upcoming hackathon.

rbeucher commented 6 months ago

@max-anu may have some updates.

AndyHoggANU commented 5 months ago

I've had a quick look at this and it seems that many of the changes @max-anu made are in the max-updating-notebooks branch. So, to a large degree this task depends upon taking what has already been done. Would be good to do that in a way which retain's Max's contributions. Does anybody know the best way to do that?

My only suggestion is for others to check the existing branch and then to merge?

rbeucher commented 5 months ago

It's possible to split @max-anu work into multiple PRs if that helps for the workshop. We can do one / notebook.

AndyHoggANU commented 5 months ago

Yeah, I think that would be perfect!

navidcy commented 5 months ago

Agree!!

adele-morrison commented 5 months ago

Maybe this could be good practice for the group doing PR review training to hone their skills on during the hackathon?

edoddridge commented 5 months ago

It's possible to split @max-anu work into multiple PRs if that helps for the workshop. We can do one / notebook.

@rbeucher - you are much braver at messing about with git than I am. Are you envisioning cloning the branch multiple times and then resetting all but one of the files on each branch? 😬

navidcy commented 5 months ago

Maybe this could be good practice for the group doing PR review training to hone their skills on during the hackathon?

Sure. But a bunch of smaller PRs would be even better. And more manageable in general. Even the “PR review master” might have trouble dealing with a PR that they need to run all the notebooks

adele-morrison commented 5 months ago

Oh yes, that's what a meant. A bunch of small PRs and then each person in the training could work on reviewing one each.

rbeucher commented 5 months ago

I'll have a look today or tomorrow.

rbeucher commented 5 months ago

That's one way to do it @edoddridge . I want to keep @max-anu contributions. I'll have a look and report back to you.

navidcy commented 5 months ago

I believe there is a way to keep @max-anu's commits in each file! Let's do that way please...

rbeucher commented 5 months ago

OK, so a bunch of PRs as you can see... sorry for spamming you. Each of them contains only one notebook with the modification from @max-anu. I have kept the authorship.

navidcy commented 5 months ago

Thanks @rbeucher!

I was out of my desk and came back to 100 emails and got terrified! But all good -- my heart beats are back to resting rates now.

anton-seaice commented 5 months ago

I reviewed 3 of the PRs yesterday and they all need(ed) some work to finalise them to be ready (e.g. bring up to date with current main, fix inconsistencies).

I suggest we:

anton-seaice commented 4 months ago

For all the Intake catalogue PRs, we should

navidcy commented 4 months ago

There are a bunch of PRs now open but from the aftermath of the hackathon I got the sense that they are not quite ready and they need quite a lot of fiddling to work? I got the vibe that people felt it would be easier to make the conversion from scratch than trying to fix a PR that didn't work....

I am gonna close them... I'll give it a few days so that anybody that has other opinions of how to move forward can post here.

AndyHoggANU commented 4 months ago

I would advocate for not closing them. I think people started working on them, and we should continue to work and finish them all off! It won't happen overnight, but let's try to do what we can...

adele-morrison commented 4 months ago

I am with Navid. Because I know nothing about the intake catalog, I would find it much easier to start from scratch and do it myself (while learning something), rather than battling with PRs that have a bunch of errors and may be building on very outdated notebooks.

But very happy if NRI folks @anton-seaice @dougiesquire etc want to take these on instead and work from the existing PRs. Also note that at the hackathon we agreed these PRs should each have 2 reviewers, because so many changes are needed from what has been submitted.

AndyHoggANU commented 4 months ago

OK, let's leave them open for a few weeks and see if people self-assign? Agree with 2 reviewers!

navidcy commented 4 months ago

OK. Happy to leave them open for a bit and gauge the vibe!

Just to clarify: I'm all in for the conversion. I just heard people complaining that it was hard work for them to understand what wasn't working in these PRs and fix it and that they would prefer to do it from scratch after they had been watched/gone through an Intake tutorial. (The sediment that @adele-morrison shares above!) So my only argument to closing them is to spare people from the trouble (and potentially the trauma) that would make them less prone in contributing in the recipes later...

navidcy commented 4 months ago

Question: do we always need to load pandas with the intake? In the PRs I see often import pandas as pd despite it wasn't imported before...

navidcy commented 4 months ago

The Intake tutorial does not load pandas anywhere, yet, all the PRs include import pandas as pd.

I may be naïve but I don't see why. Anybody has an idea? @rbeucher, @anton-seaice, @dougiesquire?

dougiesquire commented 4 months ago

Question: do we always need to load pandas with the intake? In the PRs I see often import pandas as pd despite it wasn't imported before...

@navidcy there's no requirement to import pandas

navidcy commented 4 months ago

Makes sense. Why do all the automatically-generated PRs add pandas then? I mean this could be a rhetorical question... Don't feel obliged to answer it.

(I was confused by it actually and was inferring that I don't know Intake well enough...)

dougiesquire commented 4 months ago

I'm not sure sorry - @max-anu generated those. But the imports can be removed if pandas isn't explicitly used anywhere.

rbeucher commented 4 months ago

Hi @Navid Constantinou,

The idea behind generating those PRs was to have a starting point and use the work from @max-anu.

I didn't get the opportunity to review them, and I chose not to rebase the code onto main.

Yes, they do need work, and cleaning.

I don't think that closing PRs before the conversion was done is a good idea.

Thomas-Moore-Creative commented 4 months ago

I don't know about all the PR's or if this is helpful but for #372 here is a list of the changes made: https://github.com/COSIMA/cosima-recipes/pull/372#issuecomment-2199383566

( mostly thanks to guidance above from @anton-seaice )

julia-neme commented 4 months ago

I was working on switching my recipes to using intake and I have two comments:

  1. I think that not being able to add start_time and end_time as arguments when opening a variable is inefficient. Is this a feature that could be added to intake?
  2. The catalog does not have any panant experiments indexed, so it won't work for any of the notebooks doing things model agnostic yet. Can these experiments be added too?
anton-seaice commented 4 months ago

Thanks Julia! Its good people are getting stuck into it

  1. I think that not being able to add start_time and end_time as arguments when opening a variable is inefficient. Is this a feature that could be added to intake?

@dougiesquire can explain this better than I can. There's two things that come to mind (explained more in this tutorial).

running the .to_dask() command doesn't load the data into memory yet, it's loaded 'lazily', so its just got the dimensions, coordinates & metadata available but won't load the actual data until a calculation is performed. This allows the times needed for the analysis to be selected, and once the calculation is performed only data from these times should be loaded. i.e. like this:

data_ic = catalog[experiment].search(
    variable=variable, 
    frequency="1mon"
).to_dask()

data_ic_times = data_ic.sel(time=slice(start_time, end_time))

.. do the calculation

If that to_dask step is slow, it may be due to there being many files to concatenate. You can make you instance bigger, make sure you start a dask cluster (with threads_per_worker=1) before running to_dask, and add these keywords to the .to_dask() call:

xarray_combine_by_coords_kwargs=dict(
    compat="override",
    data_vars="minimal",
    coords="minimal"
)

catalog[experiment].search(
    variable=variable, 
).to_dask(
    xarray_combine_by_coords_kwargs=xarray_combine_by_coords_kwargs,
)

These keywords reduce the number of consistencny checks xarray does when concatenating the netcdf files with the data. We don't really need the checks because we know model data is output on a consistent grid and with the same variables in every file etc

  1. The catalog does not have any panant experiments indexed, so it won't work for any of the notebooks doing things model agnostic yet. Can these experiments be added too?

We did start digging into that here ... if there are specific experiments you think are worth adding, can you add them to that thread please?

navidcy commented 4 months ago

I'm hitting this issue that @julia-neme mentions above!

I need to load 500 years of data just to slice 2 months of output... seems a bit over the top?

Just adding another voice here in the struggles. Nothing more to add... :)

navidcy commented 4 months ago

@anton-seaice I'm not sure exactly what you are suggesting (not sure if it's a bit an esoteric dialogue with @dougiesquire in which case I can ignore)

But before we could avoid loading 500 years of output with 2 keywords. What's the equivalent way to do it now? We need to add an example in the Intake tutorial -- sorry if this is there and (me and @julia-neme) seem to have missed it...

julia-neme commented 4 months ago

I've tested how much time it takes with the cookbook, intake without kwargs and intake with kwargs (https://gist.github.com/julia-neme/80a277a0f877c044b2a83a097bba744c)

The cookbook is significantly faster and the kwargs don't make much improvement :( But maybe there's a smarter way to set them up that I'm not aware of.

navidcy commented 4 months ago

Thanks @julia-neme

Hm...........

That's 20x slowdown.....

I'm feeling bit eerie now with Intake....

adele-morrison commented 4 months ago

Great job testing @julia-neme ! Any way we can fix this @dougiesquire @anton-seaice ?

dougiesquire commented 4 months ago

Thanks all for the feedback. Yeah, unfortunately intake-esm doesn't currently provide the ability to filter based on time ranges, so the full dataset is always opened (lazily). When the user actually only wants a small temporal subset of the data, this will add a small additional overhead.

A couple of things to note:

That's 20x slowdown.....

Regarding @julia-neme's timing tests, one has to be careful checking the time taken to open xarray objects since the data is cached. I couldn't reproduce Julia's results and my guess is that the cell calling cc.querying.getvar had already been run in Julia's notebook, so the data was cached. If I run Julia's notebook fresh on a full Cascade Lake node (48 cores) I get the following timings:

So slower, of course, but not a factor of 20x. The difference will obviously also reduce with the size of the time range needed.

I should also point out that there is an open issue and PR with intake-esm to add the ability to filter based on time ranges. I think we'll have more resources soon to point at the ACCESS-NRI Intake catalog and this could be something we take on.

navidcy commented 4 months ago

Thanks for this @dougiesquire.

2x slower is just fine of course!

I've seen that @julia-neme was bumping into more (related?) issues? see https://github.com/COSIMA/cosima-recipes/pull/363#issuecomment-2242157082

julia-neme commented 4 months ago

Probably my issues are worsened because I am working with small ARE sessions, or medium. I don't really want to open a full node that I'd only need for the loading, but not really for any of the computations after. Specially if I can open data via some other, less costly way (i.e. xr.open_mfdataset, cookbook). To me it seems a waste of resources? But maybe it's not? What do people think?