Specifying multiple data sets when creating lens object causes each evaluator to only be run once

esherman-credo commented 1 year ago

Expected Behavior

Lens should return results for each specified dataset, and for each evaluator (e.g. 2 data sets + 1 evaluator --> 2 sets of results).

Actual Behavior

Lens is currently only returning one result set when both assessment_data and training_data are passed to the Lens() constructor. It appears that only the results from the training_data are processed (likely because that parameter is listed 2nd?)

Creating data objects as follows:

credo_model = ClassificationModel(
    'credit_default_classifier',
    model
)
train_data = TabularData(
    name = "UCI-credit-default-train",
    X=X_train,
    y=y_train,
    sensitive_features=sensitive_features_train
)
test_data = TabularData(
    name='UCI-credit-default-test',
    X=X_test,
    y=y_test,
    sensitive_features=sensitive_features_test
)

Creating pipeline and Lens object as follows

# pipeline scan be specifed using a sklearn-like style
metrics = ['accuracy_score', 'roc_auc_score']
pipeline = [
    (Performance(metrics), 'Performance Assessment'),
    (ModelFairness(metrics), "Fairness Assessment"),
]

# and added to lens in one fell swoop
lens = Lens(
    model=credo_model,
    assessment_data=test_data,
    training_data=train_data,
    pipeline=pipeline
)

lens.run()

Expected Results: dictionary with 2 sets each of Performance and ModelFairness results. Running as above yields results that look like this (not enough entries and entries are clearly on training_data since performance is so strong):

Commenting out the line that passes train_data to Lens() constructor yields performance that looks like this:

esherman-credo commented 1 year ago

Github won't let me upload a notebook as evidence and I don't want to push my testing notebook into the repo. This should be sufficient to replicate the bug (need to have the usual training_script.py in the right folder).

The output of the last two cells will differ. The first will be the "correct" results when omitting training data. The second reveals the bug with training data included. The second should return two sets of results: one for the training data and one for the assessment data (or possibly just the assessment data...).

%load_ext autoreload
%autoreload 2
from credoai.lens import Lens
from credoai.artifacts import TabularData, ClassificationModel
from credoai.evaluators import *
from credoai.governance import Governance
import numpy as np
%run ~/credoai_lens/docs/notebooks/training_script.py

## Set up artifacts
credo_model = ClassificationModel(
    'credit_default_classifier',
    model
)
train_data = TabularData(
    name = "UCI-credit-default-train",
    X=X_train,
    y=y_train,
    sensitive_features=sensitive_features_train
)
test_data = TabularData(
    name='UCI-credit-default-test',
    X=X_test,
    y=y_test,
    sensitive_features=sensitive_features_test
)
# pipeline scan be specifed using a sklearn-like style
metrics = ['accuracy_score', 'roc_auc_score']
pipeline = [
    (Performance(metrics), 'Performance Assessment'),
    (ModelFairness(metrics), "Fairness Assessment"),
]

# and added to lens in one fell swoop
lens = Lens(
    model=credo_model,
    assessment_data=test_data,
    # training_data=train_data,
    pipeline=pipeline
)

lens.run()
## Getting evidence or results out
results = lens.get_results()
results['Fairness Assessment']

###RUN WITHOUT TRAINING DATA SPECIFIED
# pipeline scan be specifed using a sklearn-like style
metrics = ['accuracy_score', 'roc_auc_score']
pipeline = [
    (Performance(metrics), 'Performance Assessment'),
    (ModelFairness(metrics), "Fairness Assessment"),
]

# and added to lens in one fell swoop
lens = Lens(
    model=credo_model,
    assessment_data=test_data,
    training_data=train_data,
    pipeline=pipeline
)

lens.run()
results = lens.get_results()
results['Fairness Assessment']

IanAtCredo commented 1 year ago

Ok, so, the issue seems to be IDs colliding. If you do not specify IDs, things work fine.

For instance, here's a pipeline I get when having assessment and training datasets:

{'Performance_2170dc': {'evaluator': <credoai.evaluators.performance.Performance at 0x29f4192e0>,
  'meta': None},
 'ModelEquity_d89670': {'evaluator': <credoai.evaluators.equity.ModelEquity at 0x29f4198e0>,
  'meta': {'sensitive_feature': 'SEX'}},
 'ModelFairness_0d8993': {'evaluator': <credoai.evaluators.fairness.ModelFairness at 0x29f414df0>,
  'meta': {'sensitive_feature': 'SEX', 'dataset': 'assessment_data'}},
 'ModelFairness_6e3137': {'evaluator': <credoai.evaluators.fairness.ModelFairness at 0x29f414d90>,
  'meta': {'sensitive_feature': 'SEX', 'dataset': 'training_data'}},
 'DataFairness_ce5eb9': {'evaluator': <credoai.evaluators.data_fairness.DataFairness at 0x29f24e5e0>,
  'meta': {'sensitive_feature': 'SEX', 'dataset': 'assessment_data'}},
 'DataFairness_2553ee': {'evaluator': <credoai.evaluators.data_fairness.DataFairness at 0x29f234d60>,
  'meta': {'sensitive_feature': 'SEX', 'dataset': 'training_data'}}}

As you can see, when IDs are not specified IDs are generated, which ensures they are unique.

esherman-credo commented 1 year ago

Yeah, @fabrizio-credo and I discovered that earlier. The question is do we even want to allow people to custom-specify their own ID? Removing that functionality might be the easiest fix.

https://github.com/credo-ai/credoai_lens/blob/develop/tests/test_lens.py#L200 relies on this functionality...changing it wouldn't be a big deal but curious about the broader usefulness of being able to specify IDs

IanAtCredo commented 1 year ago

The issue is twofold:

When an ID overlaps, it is flagged and leads to an error. However, this isn't brought up when you add the same evaluator for different datasets.

If ID isn't specified at all, a new ID is created.

I agree @esherman-credo that customizing IDs is not very helpful and introduces these issues. It's nice to not work with IDs with random strings at the end, but not sure if that's worth worrying about.

Separately...

Separately, how does training/assessment interact with evidence creation? I don't believe we "label" evidence based on the dataset. It's only associated with metadata. So that can lead to assessments on training data being used to meet the needs of an assessment. Something to think about in the future.

credo-ai / credoai_lens