esgf2-us / intake-esgf

Programmatic access to the ESGF holdings
https://intake-esgf.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
10 stars 6 forks source link

to_datatset_dict fails when datetime_stop is nan #59

Closed jsta closed 2 months ago

jsta commented 4 months ago

I believe that catalog.py has trouble when a field is nan instead of string on approximately the line referenced below.

import intake_esgf
from intake_esgf import ESGFCatalog

print(intake_esgf.__version__)

cat = ESGFCatalog()
cat.search(
    experiment_id="ssp585",
    source_id="IITM-ESM",    
    variable_id=["tas"],
    table_id="Amon",
)
cat.to_dataset_dict()

# 2024.5.2
#    Searching indices: 100%|███████████████████████████████████████████████████████████ ███████████████████████████████████████████|2/2 [    2.54index/s]
# Traceback (most recent call last):
#   File "<snip>/test.py", line 37, in <module>
#     cat.to_dataset_dict()
#   File "<snip>/lib/python3.12/site-packages/intake_esgf/catalog.py", line 652, in to_dataset_dict
#     key = separator.join([row[k] for k in output_key_format])
#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# TypeError: sequence item 1: expected str instance, float found

Here is the summary info for cat (note how one of the datetime_stop[s] is nan):

Summary information for 2 results: institution_id [CCCR-IITM] activity_drs [ScenarioMIP] table_id [Amon] experiment_id [ssp585] source_id [IITM-ESM] mip_era [CMIP6] datetime_start [2015-01-17T00:00:00Z, 2015-01-16T12:00:00Z] variable_id [tas] grid_label [gn] member_id [r1i1p1f1] project [CMIP6] datetime_stop [nan, 2099-12-16T12:00:00Z]

jsta commented 4 months ago

I am dealing with this with the following:

def should_i_keep_it(sub_df):
        sub_d = sub_df.dropna()
        return sub_d.shape[0] == sub_df.shape[0]

cat.remove_incomplete(should_i_keep_it)
nocollier commented 4 months ago

Thank you for the report, apologize for the trouble. I have been a few weeks away from this, but will get back to it in the next week or so. Seems to be that some facet information is not available in the metadata record and so pandas fills it with nan's. That is later causing the interaction with parts of the code and probably there are others. In your case it appears to be when the keys of the dictionary are formed. This is slated for a rework and I will take this problem into consideration.

nocollier commented 2 months ago

This is fixed in v2024.7.15 in the sense that now by default you will not have datetime_{start|stop} in the dataframe. The problem is that not all records in the ESGF database have these fields. The next thing on my list is to rework to_dataset_dict() and then we won't be building dictionary keys from catalog columns. Will close this for now, as #62 should solve your problem.