Striveworks / valor

Valor is a centralized evaluation store which makes it easy to measure, explore, and rank model performance.
https://striveworks.github.io/valor/
Other
38 stars 4 forks source link

BUG: Valor Core Counting Datums with other label keys as true negative #739

Closed jqu-striveworks closed 1 month ago

jqu-striveworks commented 2 months ago

valor version checks

Reproducible Example

from valor_core import classification, schemas, enums

def evaluate_image_clf_groundtruths():
    return [
        schemas.GroundTruth(
            datum=schemas.Datum(
                uid="uid0",
                metadata={
                    "height": 900,
                    "width": 300,
                },
            ),
            annotations=[
                schemas.Annotation(
                    labels=[
                        schemas.Label(key="dataset1", value="ant"),
                    ],
                ),
            ],
        ),
        schemas.GroundTruth(
            datum=schemas.Datum(
                uid="uid1",
                metadata={
                    "height": 900,
                    "width": 300,
                },
            ),
            annotations=[
                schemas.Annotation(
                    labels=[
                        schemas.Label(key="dataset2", value="egg"),
                    ],
                ),
            ],
        ),
    ]

def evaluate_image_clf_predictions():
    return [
        schemas.Prediction(
            datum=schemas.Datum(
                uid="uid0",
                metadata={
                    "height": 900,
                    "width": 300,
                },
            ),
            annotations=[
                schemas.Annotation(
                    labels=[
                        schemas.Label(key="dataset1", value="ant", score=0.15),
                        schemas.Label(key="dataset1", value="bee", score=0.48),
                        schemas.Label(key="dataset1", value="cat", score=0.37),
                    ],
                )
            ],
        ),
        schemas.Prediction(
            datum=schemas.Datum(
                uid="uid1",
                metadata={
                    "height": 900,
                    "width": 300,
                },
            ),
            annotations=[
                schemas.Annotation(
                    labels=[
                        schemas.Label(key="dataset2", value="egg", score=0.15),
                        schemas.Label(key="dataset2", value="milk", score=0.48),
                        schemas.Label(key="dataset2", value="flour", score=0.37),
                    ],
                )
            ],
        ),
    ]

groundtruths = evaluate_image_clf_groundtruths()
predictions = evaluate_image_clf_predictions()

metrics_to_return = [
    enums.MetricType.DetailedPrecisionRecallCurve
]

metrics_out = classification.evaluate_classification(
    groundtruths=groundtruths,
    predictions=predictions,
    metrics_to_return=metrics_to_return,
)

def foo(k, _class, _metric):
    if k == 'dataset1':
        m = metrics_out.metrics[0]['value'][_class]
    elif k == 'dataset2':
        m = metrics_out.metrics[1]['value'][_class]

    for threshold in [x / 100 for x in range(5, 100, 5)]:
        print(f"{threshold:.2f}: {(m[threshold][_metric])}")

foo("dataset1", "bee", 'tn')

Issue Description

I have two datasets, each with one image. The two datasets have different keys and different classes. If run separately, they produce some metrics. These metrics should be identical to if I run evaluations with both of these datasets/prediction together since they are completely disjoint.

This behavior is wildly inconsistent. If I have one labeled datum, and one unlabeled datum, Valor does not assume that the unlabeled datum is missing a label key and does not count it as true negative.

from valor_core import classification, schemas, enums

def evaluate_image_clf_groundtruths():
    return [
        schemas.GroundTruth(
            datum=schemas.Datum(
                uid="uid0",
                metadata={
                    "height": 900,
                    "width": 300,
                },
            ),
            annotations=[
                schemas.Annotation(
                    labels=[
                        schemas.Label(key="dataset1", value="ant"),
                    ],
                ),
            ],
        ),
        schemas.GroundTruth(
            datum=schemas.Datum(
                uid="uid1",
                metadata={
                    "height": 900,
                    "width": 300,
                },
            ),
            annotations=[
                schemas.Annotation(
                    labels=[
                    ],
                ),
            ],
        ),
    ]

def evaluate_image_clf_predictions():
    return [
        schemas.Prediction(
            datum=schemas.Datum(
                uid="uid0",
                metadata={
                    "height": 900,
                    "width": 300,
                },
            ),
            annotations=[
                schemas.Annotation(
                    labels=[
                        schemas.Label(key="dataset1", value="ant", score=0.15),
                        schemas.Label(key="dataset1", value="bee", score=0.48),
                        schemas.Label(key="dataset1", value="cat", score=0.37),
                    ],
                )
            ],
        ),
    ]

groundtruths = evaluate_image_clf_groundtruths()
predictions = evaluate_image_clf_predictions()

metrics_to_return = [
    enums.MetricType.DetailedPrecisionRecallCurve
]

metrics_out = classification.evaluate_classification(
    groundtruths=groundtruths,
    predictions=predictions,
    metrics_to_return=metrics_to_return,
)

def foo(k, _class, _metric):
    if k == 'dataset1':
        m = metrics_out.metrics[0]['value'][_class]
    elif k == 'dataset2':
        m = metrics_out.metrics[1]['value'][_class]

    for threshold in [x / 100 for x in range(5, 100, 5)]:
        print(f"{threshold:.2f}: {(m[threshold][_metric])}")

foo("dataset1", "ant", 'tn')

If I have one labeled datum and one unlabeled datum and a second prediction for a second label key. It does not assume that either datum is missing the second label key and produces no metrics for the second label key.

from valor_core import classification, schemas, enums

def evaluate_image_clf_groundtruths():
    return [
        schemas.GroundTruth(
            datum=schemas.Datum(
                uid="uid0",
                metadata={
                    "height": 900,
                    "width": 300,
                },
            ),
            annotations=[
                schemas.Annotation(
                    labels=[
                        schemas.Label(key="dataset1", value="ant"),
                    ],
                ),
            ],
        ),
        schemas.GroundTruth(
            datum=schemas.Datum(
                uid="uid1",
                metadata={
                    "height": 900,
                    "width": 300,
                },
            ),
            annotations=[
                schemas.Annotation(
                    labels=[
                    ],
                ),
            ],
        ),
    ]

def evaluate_image_clf_predictions():
    return [
        schemas.Prediction(
            datum=schemas.Datum(
                uid="uid0",
                metadata={
                    "height": 900,
                    "width": 300,
                },
            ),
            annotations=[
                schemas.Annotation(
                    labels=[
                        schemas.Label(key="dataset1", value="ant", score=0.15),
                        schemas.Label(key="dataset1", value="bee", score=0.48),
                        schemas.Label(key="dataset1", value="cat", score=0.37),
                    ],
                )
            ],
        ),
        schemas.Prediction(
            datum=schemas.Datum(
                uid="uid1",
                metadata={
                    "height": 900,
                    "width": 300,
                },
            ),
            annotations=[
                schemas.Annotation(
                    labels=[
                        schemas.Label(key="dataset2", value="egg", score=0.15),
                        schemas.Label(key="dataset2", value="milk", score=0.48),
                        schemas.Label(key="dataset2", value="flour", score=0.37),
                    ],
                )
            ],
        )
    ]

groundtruths = evaluate_image_clf_groundtruths()
predictions = evaluate_image_clf_predictions()

metrics_to_return = [
    enums.MetricType.DetailedPrecisionRecallCurve
]

metrics_out = classification.evaluate_classification(
    groundtruths=groundtruths,
    predictions=predictions,
    metrics_to_return=metrics_to_return,
)

def foo(k, _class, _metric):
    if k == 'dataset1':
        m = metrics_out.metrics[0]['value'][_class]
    elif k == 'dataset2':
        m = metrics_out.metrics[1]['value'][_class]

    for threshold in [x / 100 for x in range(5, 100, 5)]:
        print(f"{threshold:.2f}: {(m[threshold][_metric])}")

foo("dataset1", "ant", 'tn')

Yet once I label the unlabeled data with the second label key. Now valor assumes datum 0 has a dataset2 label key and datum 1 has a dataset1 label key.

If it really was a true negative then it should be provided as an example (but its not). I asked for 1 example in my detailed PR curve and it did not return one despite counting a true negative.

Expected Behavior

Be Consistent. Be Correct.

The TP/FP/TN/FN count for evaluation on label key dataset 1 should be the same regardless of whether dataset 2 is present or not.

ntlind commented 2 months ago

Hey Justin - thanks for posting this issue!

If I have one labeled datum, and one unlabeled datum, Valor does not assume that the unlabeled datum is missing a label key and does not count it as true negative.

In this first example, the issue lies with the idea of an "unlabeled datum": since the second GroundTruth doesn't have any labels, it's actually not considered a GroundTruth at all and never makes it into groundtruth_df (this also means that uid1 is never considered to be a separate datum in this evaluation). This is expected behavior at the moment, but it might be wise for us to throw an error if the user tries to pass a GroundTruth without any labels. I'll leave this issue open to discuss this change with the rest of the team in the future.

If it really was a true negative then it should be provided as an example (but its not). I asked for 1 example in my detailed PR curve and it did not return one despite counting a true negative.

Good call-out. This was a bug where we didn't include true negative examples of this kind in the DetailedPRCurve output. This will be fixed in #744.

ntlind commented 1 month ago

closing this out as we've decided to remove label keys, which should fix the confusion surrounding this issue.