Open navidcy opened 10 months ago
Largely this is up to the COSIMA community, but it is likely a choice you will be forced to make for a few reasons:
accessdev
Virtual Machine is scheduled to be shuttered, and this is what provides the outward facing web address for the jenkins service)So I'd suggest taking this to a COSIMA meeting and making the community aware that a decision needs to be made, and probably the sooner the better.
One downside of dropping the Cookbook is there is currently no GUI interface for searching the Intake Database. However there are some powerful methods for searching (filtering) the intake catalogue
https://access-nri-intake-catalog.readthedocs.io/en/latest/usage/quickstart.html#data-discovery
Also there are plans for some tools to assist with data discovery from ACCESS-NRI, and we're working on prototypes we hope to share soon.
Thanks @aidanheerdegen! I'm happy to hear also voices from others here and then take it to a COSIMA meeting -- sounds good!
The advice we had from ACCESS-NRI on this back around the workshop last year was that at that time they didn’t advise switching everything to intake catalogue, because it was so new and untested. But @aidanheerdegen @dougiesquire if you think the intake catalogue is ready for us to switch everything over completely, i think we should do that. We discussed this previously at a COSIMA meeting and agreed we should switch when the time was right (which is maybe now?).
Not much to add from me:
accessdev
), which could make reviewing notebooks more robust too.I think @max-anu has been modifying the existing recipes to use the catalog. It would be good to get an update from him. There are definitely datasets that are in the cookbook but not yet in the catalog. I think @max-anu will probably have the best idea of what these are.
@anton-seaice makes a good point about retaining existing capability for those who rely on it, identifying deficiencies and setting up a new testing infrastructure.
@dougiesquire is also correct that datasets that are missing from intake should be identified.
@adele-morrison I'm not in a position to say if the intake catalogue is ready or not, but we're rapidly approaching a point where there isn't really an option, so we need to make sure it is.
So, synthesising from above, some steps to consider:
main
and docsMight be good to break this out into a GitHub project, make some issues etc. I would do this, but I'm not in a position to assist much I'm afraid.
Could the culture be shifted going forward so that when someone / a group runs an experiment that is meant to be open to the community that creating an Intake-ESM datastore is part of the process of making it available and documented? My view is that it's the "authors" of specific experiments that are best placed to create and update datastores for the catalog?
@dougiesquire appears to have done a great job documenting how that can be done for those who've never built a catalog before > I want to catalog my own data to make it easy to find and load
This new catalog is a big step forward, IMO, but will require a community effort to maintain, in my view.
Hi All,
So from the MED team perspective, I asked @max-anu to look at converting the recipes to use the intake catalog. The main reason for going ahead with this was precisely to identify the gaps that are mentioned above (missing functionalities, datasets etc.) and to push for a broader use of the intake-catalog. @max-anu should be able to give us an update.
Happy to discuss the plan going ahead.
I have started a repo on the ACCESS-NRI organisation to automate testing of the recipes: https://github.com/ACCESS-NRI/COSIMA-recipes-workflow
This is based on what I have set up for the ESMValTool recipes:
https://github.com/ACCESS-NRI/ESMValTool-workflow?tab=readme-ov-file#recipes-current-status
Will keep you posted.
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:
Will keep you posted.
@rbeucher do you have any updates? It would be great to get an update on where @max-anu has got to so we can try to progress this in the upcoming hackathon.
@max-anu may have some updates.
I've had a quick look at this and it seems that many of the changes @max-anu made are in the max-updating-notebooks
branch. So, to a large degree this task depends upon taking what has already been done. Would be good to do that in a way which retain's Max's contributions. Does anybody know the best way to do that?
My only suggestion is for others to check the existing branch and then to merge?
It's possible to split @max-anu work into multiple PRs if that helps for the workshop. We can do one / notebook.
Yeah, I think that would be perfect!
Agree!!
Maybe this could be good practice for the group doing PR review training to hone their skills on during the hackathon?
It's possible to split @max-anu work into multiple PRs if that helps for the workshop. We can do one / notebook.
@rbeucher - you are much braver at messing about with git than I am. Are you envisioning cloning the branch multiple times and then resetting all but one of the files on each branch? 😬
Maybe this could be good practice for the group doing PR review training to hone their skills on during the hackathon?
Sure. But a bunch of smaller PRs would be even better. And more manageable in general. Even the “PR review master” might have trouble dealing with a PR that they need to run all the notebooks
Oh yes, that's what a meant. A bunch of small PRs and then each person in the training could work on reviewing one each.
I'll have a look today or tomorrow.
That's one way to do it @edoddridge . I want to keep @max-anu contributions. I'll have a look and report back to you.
I believe there is a way to keep @max-anu's commits in each file! Let's do that way please...
OK, so a bunch of PRs as you can see... sorry for spamming you. Each of them contains only one notebook with the modification from @max-anu. I have kept the authorship.
Thanks @rbeucher!
I was out of my desk and came back to 100 emails and got terrified! But all good -- my heart beats are back to resting rates now.
I reviewed 3 of the PRs yesterday and they all need(ed) some work to finalise them to be ready (e.g. bring up to date with current main, fix inconsistencies).
I suggest we:
For all the Intake catalogue PRs, we should
Client(tcp...)
to Client()
pandas
or matplotlib
or IPython.display
or %matplotlib inline
or %config InlineBackend.figure_format='retina'
cat_subset
and var_subset
There are a bunch of PRs now open but from the aftermath of the hackathon I got the sense that they are not quite ready and they need quite a lot of fiddling to work? I got the vibe that people felt it would be easier to make the conversion from scratch than trying to fix a PR that didn't work....
I am gonna close them... I'll give it a few days so that anybody that has other opinions of how to move forward can post here.
I would advocate for not closing them. I think people started working on them, and we should continue to work and finish them all off! It won't happen overnight, but let's try to do what we can...
I am with Navid. Because I know nothing about the intake catalog, I would find it much easier to start from scratch and do it myself (while learning something), rather than battling with PRs that have a bunch of errors and may be building on very outdated notebooks.
But very happy if NRI folks @anton-seaice @dougiesquire etc want to take these on instead and work from the existing PRs. Also note that at the hackathon we agreed these PRs should each have 2 reviewers, because so many changes are needed from what has been submitted.
OK, let's leave them open for a few weeks and see if people self-assign? Agree with 2 reviewers!
OK. Happy to leave them open for a bit and gauge the vibe!
Just to clarify: I'm all in for the conversion. I just heard people complaining that it was hard work for them to understand what wasn't working in these PRs and fix it and that they would prefer to do it from scratch after they had been watched/gone through an Intake tutorial. (The sediment that @adele-morrison shares above!) So my only argument to closing them is to spare people from the trouble (and potentially the trauma) that would make them less prone in contributing in the recipes later...
Question: do we always need to load pandas with the intake? In the PRs I see often import pandas as pd
despite it wasn't imported before...
The Intake tutorial does not load pandas anywhere, yet, all the PRs include import pandas as pd
.
I may be naïve but I don't see why. Anybody has an idea? @rbeucher, @anton-seaice, @dougiesquire?
Question: do we always need to load pandas with the intake? In the PRs I see often import pandas as pd despite it wasn't imported before...
@navidcy there's no requirement to import pandas
Makes sense. Why do all the automatically-generated PRs add pandas then? I mean this could be a rhetorical question... Don't feel obliged to answer it.
(I was confused by it actually and was inferring that I don't know Intake well enough...)
I'm not sure sorry - @max-anu generated those. But the imports can be removed if pandas isn't explicitly used anywhere.
Hi @Navid Constantinou,
The idea behind generating those PRs was to have a starting point and use the work from @max-anu.
I didn't get the opportunity to review them, and I chose not to rebase the code onto main.
Yes, they do need work, and cleaning.
I don't think that closing PRs before the conversion was done is a good idea.
I don't know about all the PR's or if this is helpful but for #372 here is a list of the changes made: https://github.com/COSIMA/cosima-recipes/pull/372#issuecomment-2199383566
( mostly thanks to guidance above from @anton-seaice )
I was working on switching my recipes to using intake
and I have two comments:
start_time
and end_time
as arguments when opening a variable is inefficient. Is this a feature that could be added to intake
?panant
experiments indexed, so it won't work for any of the notebooks doing things model agnostic yet. Can these experiments be added too? Thanks Julia! Its good people are getting stuck into it
- I think that not being able to add
start_time
andend_time
as arguments when opening a variable is inefficient. Is this a feature that could be added tointake
?
@dougiesquire can explain this better than I can. There's two things that come to mind (explained more in this tutorial).
running the .to_dask()
command doesn't load the data into memory yet, it's loaded 'lazily', so its just got the dimensions, coordinates & metadata available but won't load the actual data until a calculation is performed. This allows the times needed for the analysis to be selected, and once the calculation is performed only data from these times should be loaded. i.e. like this:
data_ic = catalog[experiment].search(
variable=variable,
frequency="1mon"
).to_dask()
data_ic_times = data_ic.sel(time=slice(start_time, end_time))
.. do the calculation
If that to_dask
step is slow, it may be due to there being many files to concatenate. You can make you instance bigger, make sure you start a dask cluster (with threads_per_worker=1) before running to_dask, and add these keywords to the .to_dask()
call:
xarray_combine_by_coords_kwargs=dict(
compat="override",
data_vars="minimal",
coords="minimal"
)
catalog[experiment].search(
variable=variable,
).to_dask(
xarray_combine_by_coords_kwargs=xarray_combine_by_coords_kwargs,
)
These keywords reduce the number of consistencny checks xarray does when concatenating the netcdf files with the data. We don't really need the checks because we know model data is output on a consistent grid and with the same variables in every file etc
- The catalog does not have any
panant
experiments indexed, so it won't work for any of the notebooks doing things model agnostic yet. Can these experiments be added too?
We did start digging into that here ... if there are specific experiments you think are worth adding, can you add them to that thread please?
I'm hitting this issue that @julia-neme mentions above!
I need to load 500 years of data just to slice 2 months of output... seems a bit over the top?
Just adding another voice here in the struggles. Nothing more to add... :)
@anton-seaice I'm not sure exactly what you are suggesting (not sure if it's a bit an esoteric dialogue with @dougiesquire in which case I can ignore)
But before we could avoid loading 500 years of output with 2 keywords. What's the equivalent way to do it now? We need to add an example in the Intake tutorial -- sorry if this is there and (me and @julia-neme) seem to have missed it...
I've tested how much time it takes with the cookbook, intake without kwargs and intake with kwargs (https://gist.github.com/julia-neme/80a277a0f877c044b2a83a097bba744c)
The cookbook is significantly faster and the kwargs don't make much improvement :( But maybe there's a smarter way to set them up that I'm not aware of.
Thanks @julia-neme
Hm...........
That's 20x slowdown.....
I'm feeling bit eerie now with Intake....
Great job testing @julia-neme ! Any way we can fix this @dougiesquire @anton-seaice ?
Thanks all for the feedback. Yeah, unfortunately intake-esm
doesn't currently provide the ability to filter based on time ranges, so the full dataset is always opened (lazily). When the user actually only wants a small temporal subset of the data, this will add a small additional overhead.
A couple of things to note:
That's 20x slowdown.....
Regarding @julia-neme's timing tests, one has to be careful checking the time taken to open xarray objects since the data is cached. I couldn't reproduce Julia's results and my guess is that the cell calling cc.querying.getvar
had already been run in Julia's notebook, so the data was cached. If I run Julia's notebook fresh on a full Cascade Lake node (48 cores) I get the following timings:
So slower, of course, but not a factor of 20x. The difference will obviously also reduce with the size of the time range needed.
I should also point out that there is an open issue and PR with intake-esm
to add the ability to filter based on time ranges. I think we'll have more resources soon to point at the ACCESS-NRI Intake catalog and this could be something we take on.
Thanks for this @dougiesquire.
2x slower is just fine of course!
I've seen that @julia-neme was bumping into more (related?) issues? see https://github.com/COSIMA/cosima-recipes/pull/363#issuecomment-2242157082
Probably my issues are worsened because I am working with small ARE sessions, or medium. I don't really want to open a full node that I'd only need for the loading, but not really for any of the computations after. Specially if I can open data via some other, less costly way (i.e. xr.open_mfdataset
, cookbook
). To me it seems a waste of resources? But maybe it's not? What do people think?
Why we need to have two ways for people to load output? At the moment, after #298, we have tutorials both for the cookbook and for the ACCESS-NRI Intake catalog
https://cosima-recipes.readthedocs.io/en/latest/tutorials.html
Let's just leave the best of the two. Is ACCESS-NRI Intake catalog the future? Let's make that the default in all examples and drop the cookbook then?
cc @angus-g, @aidanheerdegen, @AndyHoggANU, @aekiss, @adele-morrison, @anton-seaice, @PaulSpence, @rmholmes, @edoddridge, @micaeljtoliveira
Recipes
Tutorials
ACCESS-OM2-GMD-Paper-Figs