Closed sarahclaude closed 11 months ago
This should work as you describe, since the exclusion is basically only a .search()
, which does not require a match of all columns.
What do you get for ex
when you run the code below?
# Cut entries that do not match search criteria
if exclusions:
ex = catalog.search(**exclusions)
catalog.esmcat._df = pd.concat([catalog.df, ex.df]).drop_duplicates(keep=False)
logger.info(
f"Removing {len(ex.df)} assets based on exclusion dict : {exclusions}."
)
You're right! ex returns an empty catalog, guess I should modify my search for it to works
Reopening this! I'm looking into this, but this is definitely a bug (and probably a new one). From what I can tell so far, it's that we're not at fault. This line basically ensures a require_all_on
, which causes the search to be empty.
https://github.com/intake/intake-esm/blob/bfdfb5123d1df3d2bcf9b42493d81607e83e547b/intake_esm/_search.py#L56C8-L56C8
Our tests only had a single entry in exclusions
, which is why it slipped by.
@aulemahal Thoughts? I see that it is quite an old code, according to Git Blame... I don't know how it slipped by.
esm_datastore.search
states:
require_all_on : list, str, optional
A dataframe column or a list of dataframe columns across
which all entries must satisfy the query criteria.
If None, return entries that fulfill any of the criteria specified
in the query, by default None.
But the actual code is:
def search(
*, df: pd.DataFrame, query: dict[str, typing.Any], columns_with_iterables: set
) -> pd.DataFrame:
"""Search for entries in the catalog."""
if not query:
return pd.DataFrame(columns=df.columns)
global_mask = np.ones(len(df), dtype=bool)
for column, values in query.items():
local_mask = np.zeros(len(df), dtype=bool)
column_is_stringtype = isinstance(
df[column].dtype, (object, pd.core.arrays.string_.StringDtype)
)
column_has_iterables = column in columns_with_iterables
for value in values:
if column_has_iterables:
mask = df[column].str.contains(value, regex=False)
elif column_is_stringtype and is_pattern(value):
mask = df[column].str.contains(value, regex=True, case=True, flags=0)
elif pd.isna(value):
mask = df[column].isnull()
else:
mask = df[column] == value
local_mask = local_mask | mask
global_mask = global_mask & local_mask
results = df.loc[global_mask]
return results.reset_index(drop=True)
This starts by putting everything as True with global_mask
, but only keeps as True the lines where both this and local_mask
are True. By iterating through each criteria, you basically only keep entries that were True for everything that was requested, which contradicts the above definition...
I'd start with a False global_mask
, then use an OR argument to gradually turn lines True.
Setup Information
Description
Not sure if this is bug or feature request.
When I add more than one catalog columns in xs.search_data_catalogs exclusions, all exclusions are ignored but when I only have one (exemple: id) they are excluded.
Steps To Reproduce
Additional context
No response
Contribution