train a new model with additional "challenging images" annotation data

keighrim commented 2 months ago

Because

With the additional data (https://github.com/clamsproject/aapb-annotations/pull/98) and new data_loader.py code (#115), I'd like to conduct experiments with new models and see if (and how much) the additional data is helpful improving the classification performance.

Done when

[ ] New models with different hyper parameters are trained and evaluated ~(via k-fold validation)~
[ ] ~The k-fold validation results are analyzed to measure the impact of the new data~ (no longer doing k-fold, see below)
[ ] If necessary, release a new version of SWT app with newly trained models

Additional context

No response

keighrim commented 1 month ago

Recent conversation added some changes to the plan. Specifically, in addition to the existing k-fold validation during training, we'd like to build a new set of fixed validation set for conventional training-validation workflow. That new set will be based on the 2-X_pbd subset in the "challenging images" dataset. That leaves only 1-1_bm to be added to the training.

Regarding the "impact" of the new data, we'd also like to use the fixed validation set for different models from different subsets of training data, that can show the impact of incremental annotations over time in the past year or so.

Also, we decide to run some additional experiments with re-introduction of "pre-binning" and see the effect of competing, but not-so-interesting labels (vs total disregard of irrelevant labels as -). Potentially all the labels can be "of-interest" for future tasks, so unless having the all competing labels significantly degrade the accuracy performance of the model, we'd like to keep them.

keighrim commented 1 month ago

copying a message from @owencking over slack today.

I finished creating that "PBD" evaluation set. The linked file includes images, labels, and some documentation. There is no overlap between the assets in this set and the assets used in any training set. Moreover, there is no (or minimal) overlap between the programs/series in this set and those in the training sets. Hence, this set can serve as a benchmark test set across rounds of training. Furthermore, it can serve as a check against overfitting.

keighrim commented 4 weeks ago

I ended up with running a set of experiments using the same hyperparameters from the latest version, but adding other convnext backbones with different sizes as Owen requested. For this experiment, my primary goal was to see the impact of the training data size.

Here's the plotted results - exp-trainsize.zip

In the plots, Overall shows the macro average of P/R/F scores across all 19 (18 + neg) raw labels (no subtypes), and all shows individual P/R/F scores for select raw labels.

clamsproject / app-swt-detection