pseudo labelling workflow

TomasBeuzen commented 3 years ago

@ebgoldstein I uploaded a notebook demonstrating a simple pseudo-labelling workflow (note that it draws from a "data" directory but I haven't uploaded the labelled images used to train a classifier because I wasn't sure of they had a DOI yet or not?).

We could turn this into a function/class/script to formalise it if we want to go down this path further.

ebgoldstein commented 3 years ago

ok — it's taken me too long to circle back to this — sorry. I definitely think this is a worthy approach, it's an excellent esp. for this binary class that seems very learnable.

I think also by iterating over all the webcat images we are going to see some interesting things. I guess there are 2 questions/comments on my mind:

What is the right probability threshold?
the uncertain examples will likely teach us something...

TomasBeuzen commented 3 years ago

A few thoughts:

In the example it's a simple 3-layer convnet that I just threw together, so if we had a better model we'd expect results to be better probably. I think you've played around with models more than I have here - do you have any thoughts on a better model to use?
I'd be inclined to set the threshold reasonably high, it would probably take a bit of trial-and-error, but say >0.95
Agreed about the uncertain example! It would be cool to offload them into an "uncertain" folder so we can look at them, and manually label them!

I could probably get this up and running soon to create a dataset of say 1-2k images to see how they look.

ebgoldstein commented 3 years ago

let me check w/ @somyamohanty re: model architecture. He was experimenting again last week.

ebgoldstein commented 3 years ago

I think one aspect to think about is making sure we don't reinforce bias in the training dataset.. Hence my Q about the threshold.. perhaps reducing the threshold introduces samples that are different from the training data. Ore thinking about the labels as 'noisy'.

@somyamohanty also brought up the possibility of trying adversarial testing with any model that does pseudo-labeling..

ebgoldstein commented 3 years ago

as for architectures: https://github.com/UNCG-DAISY/TinyDuneCollision/tree/master/src

the one we are going w/ currently for the tiny work is here: https://github.com/UNCG-DAISY/TinyDuneCollision/blob/master/src/TinyML_End_to_End.ipynb

though our current hardware implementation might not be restricted to 96px by 96px

TomasBeuzen commented 3 years ago

Yeah it's a good point about confirmation bias - one form of "regularization" I've seen in the past is to force a certain number of labels per batch, so say, in a batch of 100 images, we must label 10 of them (so the threshold would be dynamic in that sense).

In this case, I think the pseudo labelling will work great for getting the obvious collisions and obvious no collisions. My thinking would be to dump all the unlabelled ("unsure") data in a folder, and effectively do "active learning" where we manually label those "very unsure" images at some regular interval, say, whenever there are 100 images in the folder. In this way we get the efficiency of the automated pseudo-labeller, with the hopefully improved accuracy of the manual labelled "unsure images".

Pseudo-labelling confused me at first but I can see where advocates for it are coming from. It does seem like you're just adding in examples that the model already knows about, but the unlabelled test data (features) you're adding are not identical to what's in the training set already, so you are including more feature information within your training data, even if it is similar to what's there already...

ebgoldstein commented 3 years ago

I think the pseudo labeling + active learning is a good way to go. I am totally fine w/ more manual labeling.

ebgoldstein commented 3 years ago

gonna close this for now

UNCG-DAISY / PyWebCAT

pseudo labelling workflow #21