Open keighrim opened 2 months ago
Recent conversation added some changes to the plan. Specifically, in addition to the existing k-fold validation during training, we'd like to build a new set of fixed validation set for conventional training-validation workflow. That new set will be based on the 2-X_pbd
subset in the "challenging images" dataset. That leaves only 1-1_bm
to be added to the training.
Regarding the "impact" of the new data, we'd also like to use the fixed validation set for different models from different subsets of training data, that can show the impact of incremental annotations over time in the past year or so.
Also, we decide to run some additional experiments with re-introduction of "pre-binning" and see the effect of competing, but not-so-interesting labels (vs total disregard of irrelevant labels as -
). Potentially all the labels can be "of-interest" for future tasks, so unless having the all competing labels significantly degrade the accuracy performance of the model, we'd like to keep them.
copying a message from @owencking over slack today.
I finished creating that "PBD" evaluation set. The linked file includes images, labels, and some documentation. There is no overlap between the assets in this set and the assets used in any training set. Moreover, there is no (or minimal) overlap between the programs/series in this set and those in the training sets. Hence, this set can serve as a benchmark test set across rounds of training. Furthermore, it can serve as a check against overfitting.
I ended up with running a set of experiments using the same hyperparameters from the latest version, but adding other convnext backbones with different sizes as Owen requested. For this experiment, my primary goal was to see the impact of the training data size.
Here's the plotted results - exp-trainsize.zip
In the plots, Overall
shows the macro average of P/R/F scores across all 19 (18 + neg) raw labels (no subtypes), and all
shows individual P/R/F scores for select raw labels.
Because
With the additional data (https://github.com/clamsproject/aapb-annotations/pull/98) and new
data_loader.py
code (#115), I'd like to conduct experiments with new models and see if (and how much) the additional data is helpful improving the classification performance.Done when
Additional context
No response