Is loc_fscore_shore computed correctly?

DIUx-xView / xview3-reference

Reference data processing code and model for the xView3 prize challenge.

Other

44 stars 27 forks source link

Greetings! I found a strange issue when computing score using groud-truth data as predictions. I expect it to give perfect 1.0 score for all components. However, computed loc_fscore_shore is less than zero for some reason.

Here's a snippet to reproduce it:

valid_df = pd.read_csv(os.path.join(data_dir, "validation.csv"))
print(score(valid_df, valid_df, shore_root=os.path.join(data_dir, "validation")))

train_df = pd.read_csv(os.path.join(data_dir, "train.csv"))
print(score(train_df, train_df, shore_root=os.path.join(data_dir, "train")))

This outputs:

# For validation.csv
{'loc_fscore': 1.0, 'loc_fscore_shore': 0.9853892215568862, 'vessel_fscore': 1.0, 'fishing_fscore': 1.0, 'length_acc': 1.0, 'aggregate': 0.9970778443113772}

# For train.csv
{'loc_fscore': 1.0, 'loc_fscore_shore': 0.7665317139001351, 'vessel_fscore': 1.0, 'fishing_fscore': 1.0, 'length_acc': 1.0, 'aggregate': 0.953306342780027}

Am I missing something or this is really a bug in the implementation?

PS: It's also worth noting that get_shore_preds fails when scene in question has no shoreline. A quick fix is to add check that shoreline_contrours is empty:

    shoreline_contours = np.load(f"{shoreline_root}/{scene_id}_shoreline.npy", allow_pickle=True)
    + if len(shoreline_contours) == 0:
    +   return pd.DataFrame()

Without this patch I was not able to run self-check test of the metric implementation.

(1) We set a distance tolerance within which consider a detection to be "correct". The default value is 200 m (as documented in the function), and is what we'll use for scoring. To handle the general case where there may be more than one detection within that tolerance (particularly when vessels are close together), we use a Hungarian matching algorithm to assign predictions to ground truth in such a way that the global cost is minimized (see documentation for more details)

(2) This means, however, that when scoring any given set of labels (e.g. validation or training) against itself, this means that the "predictions" include vessels within 2.2 km of shore while the "ground truth" includes vessels within 2 km of shore. Thus, occasionally, you'll have an extra vessel within the 200m tolerance, which is what takes the F1 down a bit from 1.0 for the "close-to-shore" metric under the default distance tolerance.

(3) If you rerun the code with the tolerance kwarg set as tolerance=0.5 -- so, less than a meter -- you get the result you'd expect for both train and validation:

{'loc_fscore': 1.0, 'loc_fscore_shore': 1.0, 'vessel_fscore': 1.0, 'fishing_fscore': 1.0, 'length_acc': 1.0, 'aggregate': 1.0}

We'll add a bit to the documentation to make this clearer.

(4) Re: the patch -- we didn't expect folks to run this on the train set, in hindsight we should have tested this. You're correct re: the no-shoreline issue when running for the train set specifically (this is not an issue for the val set). We'll add this patch in, thanks for catching it!

(5) Only the validation set has a significant number of close-to-shore detections -- so think about how to best make use of these scenes

DIUx-xView / xview3-reference

Is loc_fscore_shore computed correctly? #1