search_data_catalogs exclusions with more than one key

sarahclaude commented 11 months ago

Setup Information

xscen version: 0.7.1
Python version: 3.10

Description

Not sure if this is bug or feature request.

When I add more than one catalog columns in xs.search_data_catalogs exclusions, all exclusions are ignored but when I only have one (exemple: id) they are excluded.

Steps To Reproduce

subcat = xs.search_data_catalogs(
  data_catalogs = sim_cat, 
  variables_and_freqs = {'tasmax': 'D'}, 
  exclusions = {
    'id': 'CMIP6_ScenarioMIP_EC-Earth-Consortium_EC-Earth3_ssp585_r3i1p1f1_global', 
    'source': 'FGOALS-g3'}
)

subcat['CMIP6_ScenarioMIP_EC-Earth-Consortium_EC-Earth3_ssp585_r3i1p1f1_global']
< catalog with 1 dataset(s) from 86 asset(s)>

Additional context

No response

Contribution

[ ] I would be willing/able to open a Pull Request to address this bug.

RondeauG commented 11 months ago

This should work as you describe, since the exclusion is basically only a .search(), which does not require a match of all columns.

What do you get for ex when you run the code below?

    # Cut entries that do not match search criteria
    if exclusions:
        ex = catalog.search(**exclusions)
        catalog.esmcat._df = pd.concat([catalog.df, ex.df]).drop_duplicates(keep=False)
        logger.info(
            f"Removing {len(ex.df)} assets based on exclusion dict : {exclusions}."
        )

sarahclaude commented 11 months ago

You're right! ex returns an empty catalog, guess I should modify my search for it to works

RondeauG commented 11 months ago

Reopening this! I'm looking into this, but this is definitely a bug (and probably a new one). From what I can tell so far, it's that we're not at fault. This line basically ensures a require_all_on, which causes the search to be empty. https://github.com/intake/intake-esm/blob/bfdfb5123d1df3d2bcf9b42493d81607e83e547b/intake_esm/_search.py#L56C8-L56C8

Our tests only had a single entry in exclusions, which is why it slipped by.

RondeauG commented 11 months ago

@aulemahal Thoughts? I see that it is quite an old code, according to Git Blame... I don't know how it slipped by.

esm_datastore.search states:

        require_all_on : list, str, optional
            A dataframe column or a list of dataframe columns across
            which all entries must satisfy the query criteria.
            If None, return entries that fulfill any of the criteria specified
            in the query, by default None.

But the actual code is:

def search(
    *, df: pd.DataFrame, query: dict[str, typing.Any], columns_with_iterables: set
) -> pd.DataFrame:
    """Search for entries in the catalog."""

    if not query:
        return pd.DataFrame(columns=df.columns)
    global_mask = np.ones(len(df), dtype=bool)
    for column, values in query.items():
        local_mask = np.zeros(len(df), dtype=bool)
        column_is_stringtype = isinstance(
            df[column].dtype, (object, pd.core.arrays.string_.StringDtype)
        )
        column_has_iterables = column in columns_with_iterables
        for value in values:
            if column_has_iterables:
                mask = df[column].str.contains(value, regex=False)
            elif column_is_stringtype and is_pattern(value):
                mask = df[column].str.contains(value, regex=True, case=True, flags=0)
            elif pd.isna(value):
                mask = df[column].isnull()
            else:
                mask = df[column] == value
            local_mask = local_mask | mask
        global_mask = global_mask & local_mask
    results = df.loc[global_mask]
    return results.reset_index(drop=True)

(https://github.com/intake/intake-esm/blob/bfdfb5123d1df3d2bcf9b42493d81607e83e547b/intake_esm/_search.py#L32)

This starts by putting everything as True with global_mask, but only keeps as True the lines where both this and local_mask are True. By iterating through each criteria, you basically only keep entries that were True for everything that was requested, which contradicts the above definition...

I'd start with a False global_mask, then use an OR argument to gradually turn lines True.

Ouranosinc / xscen