Should the cookbook default to stricter requirements when merging/concatenating data?

COSIMA / cosima-cookbook

Framework for indexing and querying ocean-sea ice model output.

https://cosima-recipes.readthedocs.io/en/latest/

Apache License 2.0

58 stars 25 forks source link

Should the cookbook default to stricter requirements when merging/concatenating data? #319

Open dougiesquire opened 1 year ago

dougiesquire commented 1 year ago

Motivating example here: https://github.com/COSIMA/cosima-recipes/issues/229. Files with the same naming in a single experiment are on different domains depending on the output* directory. Should the cookbook check whether indexes are the same for data being merged/concatenated?

E.g. passing join="exact" to the open_mfdataset() call within the cosima-cookbook will then return an error when indexes to be aligned are not equal. This could currently be passed through kwargs, but should it be the default?

rmholmes commented 1 year ago

Thanks for catching this @dougiesquire.

I'm not sure this is a cookbook issue. I think it's more an issue with the data itself. I don't think it's a good idea to have output defined on different regions using the same file name. I'd suggest that a good way to deal with this issue is to rename the ocean_daily_3d_u_%.nc files in output196-output279 to something like ocean_daily_3d_u_southern_ocean_%.nc. They can then be separated using the nc_file argument to cc.querying.getvar.

But I guess even then, it would still be useful to flag it so that the user knows they have to use nc_file.

dougiesquire commented 1 year ago

Thanks @rmholmes. I wasn't meaning to suggest that the issue is with the cookbook, but having join="exact" as default would've saved me a bunch of time yesterday. I.e. it could be useful for helping to find/flag issues with the data.

My guess is that most uses of the cookbook are to query/load datasets that should have consistent indexes. So having join="exact" as default could make sense - users could always override the kwarg if they want to merge inconsistent data. But, I'm probably just not across the full range of cookbook use cases.

dougiesquire commented 1 year ago

But yes, for fixing the specific issue with 01deg_jra55v13_ryf9091, changing the name of the nc files sounds sensible to me. Who would be in charge of doing that?

rmholmes commented 1 year ago

Sorry @dougiesquire, I didn't completely take in your comment here as I'd copied my response across from the cosima-recipes issue you'd put up. I'd support a move to the stricter requirements.

I think @AndyHoggANU ran that simulation.

angus-g commented 1 year ago

I agree that join="exact" seems like a sensible default, it's a bit crazy that xarray tries to concatenate datasets like that in the first place!

aidanheerdegen commented 1 year ago

Is there any time penalty with join="exact"? ISTR folks complaining about xarray doing time consuming checks on coordinates under some circumstances, but it may be me misremembering, or the issue may no longer be a problem.

angus-g commented 1 year ago

I think the checks by compat are more expensive than those for join (which just tries to align dimension sizes)? Probably one of those things where it's best to just benchmark it.

dougiesquire commented 1 year ago

Agreed. I wouldn't expect any difference in speed for data that can be joined.

As @angus-g mentioned, there are other kwargs that can be changed to improve performance, but they require making some assumptions about the data being loaded. I don't know whether these are justified for the COSIMA data?

EDIT: see the Note here: https://docs.xarray.dev/en/stable/user-guide/io.html#reading-multi-file-datasets