eurec4a / eurec4a-intake

Intake catalogue for EUREC4A field campaign datasets
17 stars 19 forks source link

Updated RonBrown, NTAS, CU-RAAVEN, SD, SVP and WG catalogs #74

Closed lbariteau closed 2 years ago

lbariteau commented 3 years ago

Added catalog for RonBrown measurements and NTAS mooring. Updated link to NCEI data for CU-RAAVEN, NOAA Saildrone, SVP drifters and Wave gliders.

RobertPincus commented 3 years ago

Hi @d70-t I hope things are well with you. I helped @lbariteau a little with this very lengthy addition to the intake catalog (all the many datasets from the Ron Brown and CU-RAAVEN). The catalog is now so large that the CI may not run to completion. Don't know how to address that but thought you might have ideas.

RobertPincus commented 3 years ago

@d70-t @lbariteau Note that Ludovic's branch passes the tests when they are run locally.

d70-t commented 3 years ago

I also ran the tests on my machine successfully.

d70-t commented 3 years ago

Will this PR make #13 obsolete?

RobertPincus commented 3 years ago

@d70-t Me and @lbariteau decided not to include the Level-1 lidar data (the subject of #13) in the general Ron Brown catalog - there's just too much of it and it seems to have little user value. If @leifdenby wants to make this part of the catalog he could do so, but should point to the data's new home at NOAA NCEI rather than the PSL server.

RobertPincus commented 3 years ago

@lbariteau, @d70-t raises two questions. I don't see you revising the _FillValue and missing_value attributes. Do you want to break some of the longer datasets into pieces? If not please comment here and we will merge your many changes into the main catalog.

lbariteau commented 3 years ago

@RobertPincus and @d70-t . Thank you for your inputs and questions. This is a great learning platform for me. WRT to the 2 questions from Tobias: 1) The lengthy datasets ADCp or M-AERI were published like this by the respective PIs so for now I would leave them as is. I will pass the message along and see if they can break it down to smaller files. 2) Similarly for the atmos-chem files, I didn't produce those files so revising the _FillValue and missing_value attributes will have to be done by PIs. I will pass also that information along and see if they can correct it.

Meanwhile I suggest we merge the current changes to the main catalog and update it again when/if files are modified.

d70-t commented 3 years ago

Just to clarify: my concern was about the length of the datasets keys (in the catalogue), not the datasets themselves.

In fact, I think that datasets may become much simpler to handle if they are very large (e.g. one Dataset for the entire duration of the instrument, if it is installed at a fixed location) once we have a good working system to subset datasets efficiently (opendap is intended to be such a system). That would make something like this possible:

open_dataset("my_sensor").sel("jan 17 to feb 9")

in stead of

my_date_range = "jan 17 to feb 9"
filenames = [filename for filename in glob(....) if some_condition(filename, my_date_range)]
dataset = concat([open_dataset(filename) for filename in filenames])
dataset = dataset.sel(my_date_range)
lbariteau commented 3 years ago

Ok thanks for the clarification. I assume by dataset keys in the catalog you mean array identifiers in each individual netcdf files. I think this is indeed intentional and selected by respective users for particular reasons. ~L

d70-t commented 3 years ago

Sorry for the confusion again. I literally meant the keys in the catalog, i.e. M-AERI_SSTskin20200106171410. I would have expected something like SSTskin or the like. Such that from a user's perspective it would look like:

cat.RonBrown.MAERI.SSTskin.to_dask()

in stead of:

cat.RonBrown.MAERI["M-AERI_SSTskin20200106171410"].to_dask()

This is of course only applicable if there's just one time-range to be expected at any further time. But I was assuming that this is the case for ATOMIC / EUREC4A, as you've provided a couple of files in other occasions.

If there are multiple time ranges, we've been adding another level, either by introducing user parameters in form of a date or by defining segments (i.e. flights or legs or cruises or the like).


In general, for the catalog we'd like to put an emphasis on write once (or a few times) and read many times, so it should be a lot more easy / intuitive to read the catalog than to write it. (Hopefully, more people are on the using than on the creating end).

RobertPincus commented 3 years ago

@lbariteau, @d70-t's suggestion would apply to datasets with a single file. This seems like the ADCP, CTD, MAERI, ROSR, and UCTD. Could the dataset names for the ROSR and MAERI highlight the difference between sea surface and sea skin temperature? Do the ADCP or U/CTDs even need a second level in the hierarchy, or might they be safely referred to as cat.RonBrown.UCTD ? Feel free to be in touch offline if you have questions.

lbariteau commented 3 years ago

@d70-t and @RobertPincus Thanks both for your comments. That helps understanding what needs to be done. I will look into this and update my branch accordingly.

leifdenby commented 2 years ago

@d70-t Me and @lbariteau decided not to include the Level-1 lidar data (the subject of #13) in the general Ron Brown catalog - there's just too much of it and it seems to have little user value. If @leifdenby wants to make this part of the catalog he could do so, but should point to the data's new home at NOAA NCEI rather than the PSL server.

@RobertPincus yes, that makes sense (I just had a look at the commit diff and that is a lot of files!) If the Level-1 netCDF files are available online (and just not in the catalog) maybe you could add a line to the catalog entry description to indicate where the Level-1 files are available? I'll close #13 because yes, that is (very) outdated now

d70-t commented 2 years ago

@RobertPincus and @lbariteau Just to avoid a deadlock: My impression was that after the comments above you two will decide and merge this pull request once you are happy with it as well. Is that correct or do you want me to progress this further?

Just one reminder from my side might be: we try to keep accepted keys in the catalog as stable as possible. Thus if you plan to do a change on the keys (e.g. SSTskin), than this should be done now. If you like the current ones better, then we should be prepared to keep them for a long time. As it's your datasets, I don't want to move that decision into any particular direction.

For the potential change within the netCDF files, I'm a bit less concerned and would agree that this should be ok-ish to do afterwards.

RobertPincus commented 2 years ago

@d70-t Thanks for checking in. It's my understanding that @lbariteau will update the keys to be more concise and/or descriptive when he has a moment and I will consult if he asks me. Is that right @lbariteau ?

lbariteau commented 2 years ago

@d70-t and @RobertPincus. That is correct. I was checking a few things and will get this rolling this week. Thanks for checking in.

lbariteau commented 2 years ago

@d70-t and @RobertPincus. I have updated the catalog keys for the RonBrown. Let me know what you think. As for the netcdf _FillValue/_MissingValue, I will let the data creators do this in their files and we can update the link if need be later on.

RobertPincus commented 2 years ago

@lbariteau I think you're quite close, a few small suggestions for clarity and consistency.

lbariteau commented 2 years ago

@RobertPincus I did a few edits to reflect your recommendations for clarity and consistency. Let me know what you think and then we can merge this pull request with the master one. Thanks much!

d70-t commented 2 years ago

Awesome :tada:

There's just one little thing I can spot... The file P3/._isotope-analyzer.yaml slipped into the repository. It would be great if you could remove that before the merge.

lbariteau commented 2 years ago

@RobertPincus and @d70-t Thanks for the last edits and for the merge. Yeah! Until next time ;)