Striveworks / valor

Valor is a centralized evaluation store which makes it easy to measure, explore, and rank model performance.
https://striveworks.github.io/valor/
Other
38 stars 4 forks source link

BUG: Valor Core Detection Incorrectly Assigns True Postives #735

Closed jqu-striveworks closed 2 months ago

jqu-striveworks commented 2 months ago

valor version checks

Reproducible Example

from valor_core import detection, schemas

def _images(n) -> list[schemas.Datum]:
    return [
        schemas.Datum(
            uid=f"{i}",
            metadata={
                "height": 1000,
                "width": 2000,
            },
        )
        for i in range(n)
    ]

def evaluate_detection_functional_test_groundtruths(
    images,
) -> list[schemas.GroundTruth]:
    """Creates a dataset called "test_dataset" with some ground truth
    detections. These detections are taken from a torchmetrics unit test (see test_metrics.py)
    """

    gts_per_img = [
        {
            "boxes": [[10, 10, 20, 20], [10, 15, 20, 25]],
            "labels": ["1", "1"],
        },
    ]

    return [
        schemas.GroundTruth(
            datum=image,
            annotations=[
                schemas.Annotation(
                    labels=[schemas.Label(key="class", value=class_label)],
                    bounding_box=schemas.Box.from_extrema(
                        xmin=box[0],
                        ymin=box[1],
                        xmax=box[2],
                        ymax=box[3],
                    ),
                    is_instance=True,
                )
                for box, class_label in zip(gts["boxes"], gts["labels"])
            ],
        )
        for gts, image in zip(gts_per_img, images)
    ]

def evaluate_detection_functional_test_predictions(
    images,
) -> list[schemas.Prediction]:
    """Creates a model called "test_model" with some predicted
    detections on the dataset "test_dataset". These predictions are taken
    from a torchmetrics unit test (see test_metrics.py)
    """

    '''
    ## Valor Core Finds [FP, TP]
    preds_per_img = [
        {
            "boxes": [
                [10, 10, 20, 20],
                [10, 12, 20, 22],
            ],
            "scores": [0.78, 0.96],
            "labels": ["1", "1"],
        }
    ]
    '''

    ## Valor Core Finds [TP, TP, FP]
    preds_per_img = [
        {
            "boxes": [
                [10, 10, 20, 20],
                [10, 12, 20, 22],
                [101, 101, 102, 102],
            ],
            "scores": [0.78, 0.96, 0.87],
            "labels": ["1", "1", "1"],
        }
    ]

    db_preds_per_img = [
        schemas.Prediction(
            datum=image,
            annotations=[
                schemas.Annotation(
                    labels=[
                        schemas.Label(
                            key="class", value=class_label, score=score
                        )
                    ],
                    bounding_box=schemas.Box.from_extrema(
                        xmin=box[0],
                        ymin=box[1],
                        xmax=box[2],
                        ymax=box[3],
                    ),
                    is_instance=True,
                )
                for box, class_label, score in zip(
                    preds["boxes"], preds["labels"], preds["scores"]
                )
            ],
        )
        for preds, image in zip(preds_per_img, images)
    ]

    return db_preds_per_img

imgs = _images(1)
groundtruths = evaluate_detection_functional_test_groundtruths(imgs)
predictions = evaluate_detection_functional_test_predictions(imgs)

metrics_out = detection.evaluate_detection(
    groundtruths=groundtruths,
    predictions=predictions,
)

for i in metrics_out.metrics:
    print(i)

Issue Description

Have two prediction overlapping the same ground truth. Valor assigns 1 TP and 1 FP.

Add a completely random prediction of the same class with no IOU with the ground truths. Now both of the original predictions are suddenly true positive.

Expected Behavior

Correctly Identify TP.

czaloom commented 2 months ago

This looks like there are two related issues here:

  1. Aggregate metrics should not be including ignored labels.
  2. AR should not be including ignored labels
jqu-striveworks commented 2 months ago

I think you posted on the wrong issue.