bring back "pre-binning"

keighrim commented 1 month ago

Because

As mentioned in https://github.com/clamsproject/app-swt-detection/issues/116#issuecomment-2400092529, we want to re-evaluate the effectiveness of pre-binning.

Prebinning was originally implemented in #19 and experimented with various binary and multi-class binning configurations (proposed by @haydenmccormick ) during round 2 experiments, leading up to the first release of the app+model with "3-way" prebinning .

https://github.com/clamsproject/app-swt-detection/blob/v1.0/modeling/config/default.yml#L33-L42

(detailed results from the R2 experiments are recorded in this (privately) shared spreadsheets, R2-multiclass, R2-binary tabs)

The binning later replaced with an almost identical "4-way" post-binning scheme based on evidence from round 4 experiments (#63)

Post-binning later completely removed from model configuration as the stitcher code was isolated as an independent module and postbinning turned into a part of the stitcher (#106)

As

the model architecture has been through some significant changes especially regarding the positional encoding (#109)
new evaluation method is added for timepoint-level evaluation (#115)
a new "fixed validation" dataset is compiled with more challenging images (https://github.com/clamsproject/app-swt-detection/issues/116#issuecomment-2408144544)

in a recent conversation, we discussed re-assessment of the prebinning schemes. This issue is to discuss the implementation and execution, and also track results from the new round of experiments.

Done when

[ ] new pre-binning schemes are proposed for next round of expreiments
[ ] experiment results are reported using the PBD validation set as the ground truth set
[ ] decisions are made which scheme to use for the model in the next version of the app

Additional context

No response

keighrim commented 1 month ago

copying a message from @owencking over slack today, with proposals for new binning schemes.

I have continued thinking about how to bin labels to get meaningful cross-entropy scores during training and hyperparameter tuning. I came up with a few different binnings that might be meaningful for us. Please have a look at this file, and see what you think. However, I know it is important to have a single one to optimize against. I think "Overall-strict" and "Overall-simple" would be the best choices. If I had to choose one, I think I would choose "Overall-simple" because it will effectively ignore a lot of noise that I believe exists for the "M" and "O" labels. (This recommendation supersedes the proposed binning I suggested during our last Monday meeting.)
{
"Overall-strict": {
    "Bars": ["B"],
    "Slate": ["S","S:H","S:C","S:D","S:B","S:G"],
    "Chyron-person": ["I","N"],
    "Credits": ["C","R"],
    "Main": ["M"],
    "Opening": ["O","W"],
    "Chyron-other": ["Y","U","K"],
    "Other-text": ["L","G","F","E","T"],
    "Neg": ["P",""]
},

"Overall-simple": {
    "Bars": ["B"],
    "Slate": ["S","S:H","S:C","S:D","S:B","S:G"],
    "Chyron-person": ["I","N"],
    "Credits": ["C","R"],
    "Other-text": ["M","O","W","Y","U","K","L","G","F","E","T"],
    "Neg": ["P",""]
},

"Overall-relaxed":{
    "Bars": ["B"],
    "Slate": ["S","S:H","S:C","S:D","S:B","S:G"],
    "Chyron": ["I","N","Y","U","K"],
    "Credits": ["C","R"],
    "Other-text": ["M","O","W","L","G","F","E","T"],
    "Neg": ["P",""]
},

"Bars": {
    "Bars": ["B"],
    "Other": ["S","S:H","S:C","S:D","S:B","S:G","I","N","Y","U","K","C","R","M","O","W","L","G","F","E","T","P",""]
},

"Slate": {
    "Slate": ["S","S:H","S:C","S:D","S:B","S:G"],
    "Other": ["B","I","N","Y","U","K","C","R","M","O","W","L","G","F","E","T","P",""]
},

"Chyron-strict": {
    "Chyron-person": ["I","N"],
    "Other": ["B","S","S:H","S:C","S:D","S:B","S:G","Y","U","K","C","R","M","O","W","L","G","F","E","T","P",""]
},

"Chyron-relaxed":{
    "Chyron": ["I","N","Y","U","K"],
    "Other": ["B","S","S:H","S:C","S:D","S:B","S:G","C","R","M","O","W","L","G","F","E","T","P",""]
},

"Credits": {
    "Credits": ["C","R"],
    "Other": ["B","S","S:H","S:C","S:D","S:B","S:G","I","N","Y","U","K","M","O","W","L","G","F","E","T","P",""]
}
}

keighrim commented 1 month ago

Reporting results from a recent experiment with different binning schemes. Here is the list of binning schemes

https://github.com/clamsproject/app-swt-detection/blob/1d77c5e31039c4e3e83be90469942a5da15eea6f/modeling/gridsearch.py#L137-L186

Besides of pre-binning, the experiment is done with only two other hyperparams;

image_enc_name: the name of backbone model, convnext tiny and large were used.
block_guids_train: training data size - 1@ means the model is trained all available data, 61@ means the challenging images were blocked from being used as training data.

And here's the bar charts from the results;

per label: ba, sl, ch, cr respectively refer to bars, slates, chyron, credits. For scheme that has two chyron categories, the one with -person suffix is used. nobinning scheme is not included here because it requires additional many-to-one aggregation. https://drive.google.com/file/d/1iTKV573UNrJr8E_s0BAWj7IQ14lmm_Cj/view?usp=drive_link
Overall average P/R/F scores: https://drive.google.com/file/d/1ybhaDqytlooJ9Y8AV9e0figIg9RLSptk/view?usp=drive_link

keighrim commented 4 weeks ago

Closing this issue, as we decided not to use any "pre" binning since we don't want to lose any labels that can be potentially useful for future applications. Instead, our experimental focus will be on the "post" bin where we can experiment with not just schemes, but also algorithms (e.g., max, sum, or "learnable" binning). Since post-binning is not part of the CV modeling but rather a post processing on the model predictions, further discussion should be done under the context of #117.

clamsproject / app-swt-detection