Closed TomasBeuzen closed 3 years ago
ok — it's taken me too long to circle back to this — sorry. I definitely think this is a worthy approach, it's an excellent esp. for this binary class that seems very learnable.
I think also by iterating over all the webcat images we are going to see some interesting things. I guess there are 2 questions/comments on my mind:
A few thoughts:
I could probably get this up and running soon to create a dataset of say 1-2k images to see how they look.
let me check w/ @somyamohanty re: model architecture. He was experimenting again last week.
I think one aspect to think about is making sure we don't reinforce bias in the training dataset.. Hence my Q about the threshold.. perhaps reducing the threshold introduces samples that are different from the training data. Ore thinking about the labels as 'noisy'.
@somyamohanty also brought up the possibility of trying adversarial testing with any model that does pseudo-labeling..
as for architectures: https://github.com/UNCG-DAISY/TinyDuneCollision/tree/master/src
the one we are going w/ currently for the tiny work is here: https://github.com/UNCG-DAISY/TinyDuneCollision/blob/master/src/TinyML_End_to_End.ipynb
though our current hardware implementation might not be restricted to 96px by 96px
Yeah it's a good point about confirmation bias - one form of "regularization" I've seen in the past is to force a certain number of labels per batch, so say, in a batch of 100 images, we must label 10 of them (so the threshold would be dynamic in that sense).
In this case, I think the pseudo labelling will work great for getting the obvious collisions and obvious no collisions. My thinking would be to dump all the unlabelled ("unsure") data in a folder, and effectively do "active learning" where we manually label those "very unsure" images at some regular interval, say, whenever there are 100 images in the folder. In this way we get the efficiency of the automated pseudo-labeller, with the hopefully improved accuracy of the manual labelled "unsure images".
Pseudo-labelling confused me at first but I can see where advocates for it are coming from. It does seem like you're just adding in examples that the model already knows about, but the unlabelled test data (features) you're adding are not identical to what's in the training set already, so you are including more feature information within your training data, even if it is similar to what's there already...
I think the pseudo labeling + active learning is a good way to go. I am totally fine w/ more manual labeling.
gonna close this for now
@ebgoldstein I uploaded a notebook demonstrating a simple pseudo-labelling workflow (note that it draws from a "data" directory but I haven't uploaded the labelled images used to train a classifier because I wasn't sure of they had a DOI yet or not?).
We could turn this into a function/class/script to formalise it if we want to go down this path further.