ACCESS-NRI / access-nri-intake-catalog

Tools and configuration info used to manage ACCESS-NRI's intake catalogue
https://access-nri-intake-catalog.rtfd.io
Apache License 2.0
7 stars 0 forks source link

start_date and end_date are same in experiment 01deg_jra55v140_iaf_cycle4 for cetain variables #113

Closed utkarshgupta95 closed 9 months ago

utkarshgupta95 commented 9 months ago

Describe the bug

I am loading experiment 01deg_jra55v140_iaf_cycle4 from conda env:analysis3, but the start_date is same as the end date for the following variables:

npp2d stf09 det caco3 adic alk zoo wdet100 stf07 no3 fe stf03 pprod_gross_2d phy dic o2_yflux_adv src01 no3_xflux_adv npp3d det_xflux_adv caco3_yflux_adv adic_yflux_adv src10 src07 fe_xflux_adv adic_xflux_adv src06 dic_xflux_adv pprod_gross src05 fe_zflux_adv o2 no3_zflux_adv radbio3d caco3_zflux_adv o2_xflux_adv caco3_xflux_adv adic_zflux_adv src09 dic_yflux_adv o2_zflux_adv fe_yflux_adv src03 det_yflux_adv dic_zflux_adv det_zflux_adv no3_yflux_adv

Screenshot for one of the variable

Screenshot from 2023-09-04 16-28-04

dougiesquire commented 9 months ago

Thanks for opening this @utkarshgupta95. The issue you are describing here is really an issue with the data files themselves. These files have been saved with a single time instance per file and no information about the time bounds. Therefore, using the file metadata alone, it's not possible to determine the frequency or start_/end_date of the data. I have a bit of hacky logic that checks if a frequency is included in the filename and uses that if the frequency cannot be determined from the file contents. In your example files, this registers the frequency as "1mon" because the word "monthly" is in the filename. But I'm not sure we want to include too much of these types of hacky fixes for what is actually incomplete data. I would argue that this is something that should be fixed in the data, rather than by the catalog. Thoughts?

aidanheerdegen commented 9 months ago

Thanks for the explanation @dougiesquire

I would argue that this is something that should be fixed in the data, rather than by the catalog. Thoughts?

Yes, if the fix could be applied before the data was written. It opens a bit of a can of worms to start applying fixes IMO. First you have to determine what fix should be applied, and how. Then it needs to be tested, and then finally applied, and checked it has been done correctly.

It is not a trivial exercise to do this properly, and in the end we'd be retrofitting some logic to fix a problem that could be applied, non-destructively, at the point of use.

I don't know if the fix should be applied at the catalog level. We could do it for this use case where it becomes problematic, but the bit of information we don't have is the calendar the data is using, at least I don't think we do.

If the calendar isn't available in the catalog metadata can we include it?

dougiesquire commented 9 months ago

First you have to determine what fix should be applied, and how. Then it needs to be tested, and then finally applied, and checked it has been done correctly.

This all also has to be done to retrofit a fix at the catalog level or for your use case, though I agree it is probably substantially less effort. The problem with applying bandaids is that they only fix the specific use case. Fixing the data source would fix all use cases. Data that doesn't include sufficient metadata to determine it's frequency/bounds probably isn't ready to be shared.

All this said, we're somewhat stuck with these data and I for one don't want to spend effort adding missing metadata to the files. Checking/enforcing that basic metadata is available before sharing data is something we should consider for future datasets, but, for now, I'll try to add a "best guess" at time bounds to get_timeinfo when bounds information is not available. It would be great if you or @utkarshgupta95 would be able to review.

aidanheerdegen commented 9 months ago

All this said, we're somewhat stuck with these data and I for one don't want to spend effort adding missing metadata to the files.

+1

Checking/enforcing that basic metadata is available before sharing data is something we should consider for future datasets

+1000

for now, I'll try to add a "best guess" at time bounds to get_timeinfo when bounds information is not available.

I wasn't insisting this be done for the catalog. Maybe it is just our case that cares about this, so we should probably then make some logic to work around it. I think we would just need the calendar, which would be good to have in the metadata in any case I think?

It would be great if you or @utkarshgupta95 would be able to review.

Sure, if you decide to do this and think it worthwhile. The other option is to flag the problem in the "collection" metadata and leave as-is.