Labels are wrongly associated with scores

marcverhagen commented 11 months ago

Because

Somewhere along the line we started assigning the wrong label, probably because the way we dealt with configurations has changed several times.

The config file modeling/config/trainer.yml has:

bins:
  pre:
    slate:
      - "S"
    chyron:
      - "I"
      - "N"
      - "Y"
    credit:
      - "C"

And the classifier config in /modeling/models/20231214-193543.convnext_tiny.kfold_config.yml has

labels:
- slate
- chyron
- credit

But the method train.get_final_label_names() returns

['chyron', 'credit', 'slate', 'NEG']

and since that method is eventually used to set labels on an instance of Prediction we get a mismatch.

marcverhagen commented 11 months ago

This was fixed in https://github.com/clamsproject/app-swt-detection/commit/71b57a224438969aa1116a25ef6d8b0a103d6a1f by using the labels field from the model config file. An alternative would be to change

bins:
  pre:
    slate:
      - "S"
    chyron:
      - "I"
      - "N"
      - "Y"
    credit:
      - "C"

into

bins:
  pre:
    - slate:
      - "S"
    - chyron:
      - "I"
      - "N"
      - "Y"
    - credit:
      - "C"

and update train.get_final_label_names().

keighrim commented 11 months ago

This is probably because of this; https://github.com/yaml/pyyaml/issues/110.

Trainer config: https://github.com/clamsproject/app-swt-detection/blob/f033551e1e53926fac6b8a03095980402ca2d06a/modeling/config/trainer.yml#L43-L52

After yaml.dump-ed (exported config), keys are sorted: https://github.com/clamsproject/app-swt-detection/blob/f033551e1e53926fac6b8a03095980402ca2d06a/modeling/models/20231214-193543.convnext_tiny.kfold_config.yml#L1-L10

Considering added label mapping complexity when we add postbin handling in a very near future, I don't think classifier relying on the labels values from the model_config is a good idea, simply because that value is the "final labels" after pre+post binning. To that end, I think we should remove the labels key all together from export process to eliminate a big source of confusion. https://github.com/clamsproject/app-swt-detection/blob/f033551e1e53926fac6b8a03095980402ca2d06a/modeling/train.py#L243-L244

To fix the original wrong association issue, just adding sort_keys=False argument to yaml.dump call should be enough.

marcverhagen commented 11 months ago

Agreed on both counts. I hadn't looked closely at the new export code you had added. And indeed having both bins and labels in the model configuration is redundant. I am doing some cleanup of the code now and will include those fixes.

marcverhagen commented 11 months ago

Fixed in https://github.com/clamsproject/app-swt-detection/commit/0ab78e0c3eda4a601aee021cbf9b05eb77249a5b.

clamsproject / app-swt-detection

Labels are wrongly associated with scores #43

Because