ACCESS-NRI / access-nri-intake-catalog

Tools and configuration info used to manage ACCESS-NRI's intake catalogue
https://access-nri-intake-catalog.rtfd.io
Apache License 2.0
8 stars 1 forks source link

[Catalog utility functions] find_chunking_info #218

Open Thomas-Moore-Creative opened 1 month ago

Thomas-Moore-Creative commented 1 month ago

Is your feature request related to a problem? Please describe.

To enable a better understanding of the underlying NetCDF data structure so settings like xarray_open_kwargs can be used effectively requires discovery of the native file chunking.

Describe the feature you'd like

  1. a place for the community to help build utility functions that would support the https://github.com/ACCESS-NRI/access-nri-intake-catalog
  2. a specific function to "find the native chunking information" for a dataset in the catalog that first-time users have the ability to use and understand

Describe alternatives you've considered

Writing my own pre-alpha functions here: https://github.com/Thomas-Moore-Creative/ACDtools/blob/main/ACDtools like find_chunking_info but I'd like a place to collaborate on utilities that was easier for the whole community to see and share.

Additional context

aidanheerdegen commented 1 month ago

Hola @Thomas-Moore-Creative,

This sounds cool. I'm thinking the best place for something like this might be a repo on the ACCESS Community Hub organisation. It makes it more straightforward to collaborate as we could make you an admin on the repo.

https://github.com/ACCESS-Community-Hub

Then we could point to it from this repo

Does that sound like a good way forward?

Thomas-Moore-Creative commented 1 month ago

Sounds fine to me @aidanheerdegen - thanks. Do you, @rbeucher, @dougiesquire, or any of your software engineering gurus have advice on how to structure this repo so it's portable, flexible, and available to all on NCI?

aidanheerdegen commented 1 month ago

Dougie is on leave, so he's out of the picture.

To make it available on gadi I'd say we should add conda packaging. We could also arrange to publish it to the accessnri anaconda channel, or create another access community channel.

We can deal with that later.

As for repo structure, first decision might be flat layout vs src layout, and then isolate functionality in sub-directories.

Is that the sort of thing you were thinking about @Thomas-Moore-Creative?

Do you have any opinions @marc-white?

marc-white commented 1 month ago

I think the main thing to determine is a question of scope. What exactly are you trying to do? Is it just doing some stuff to work out the native chunking of netCDF files, or are you looking to expand this to include more tools later down the track?

Then, once you've worked out the answer to that question, that will inform your answer to the next question: should this come in as a part of access-nri-intake-catalog, or should it be spun off into its own utility package?

Thomas-Moore-Creative commented 1 month ago

Thanks @marc-white.

What I'm trying to do is get my projects done, which requires using the access-nri-intake-catalog, and for me that means data discovery, building search filters, and understanding data structure to allow optimal analysis-ready-data workflows to be built for specific datasets.

I highlighted just one type of very simple utility that I'm building ( "find the native chunking information" ) in this issue but I am wondering out loud if there is a better place to be developing helper utilities than my personal repos? Maybe the questions are:

charles-turner-1 commented 1 month ago

As for repo structure, first decision might be flat layout vs src layout, and then isolate functionality in sub-directories.

I'd recommend going with src layout for consistency - it seems to be Dougie's preferred layout, and would keep things consistent with this package itself and the related intake-dataframe-catalog.

As to whether this should be included within access-nri-intake-catalog or as a standalone package, I would suggest the latter. Lots of the functionality of the catalog, eg. loading datasets, is actually performed by intake-esm, and I suspect that this might cause complications. My vote would be for a separate package - something like access-intake-utils - and then we try to keep the interdependencies as minimal as possible.

Thomas-Moore-Creative commented 1 month ago

My vote would be for a separate package - something like access-intake-utils - and then we try to keep the interdependencies as minimal as possible.

From a users point of view this makes sense to me. Thanks for the advice.

Can we start with an access-intake-utils repo in https://github.com/ACCESS-Community-Hub, as suggested by @aidanheerdegen above?

rbeucher commented 1 month ago

I agree with @ charles-turner-1, A separate package is the way to go for now. Feel free to start in ACCESS-Community-Hub