[Feature][Core] Calculate thresholding confidence using data

Udayraj123 commented 2 years ago

The core logic of OMRChecker revolves around finding the correct separation between Marked and Unmarked bubbles. We want to let the user know if it has been determined confidently.

In the above image there are two possible thresholds based on the jumps in the histogram. In such cases the confidence metric will be useful to separate bad quality images.

More references in Rich Visuals section.

Note: this issue is marked with the hacktoberfest label. Follow #hacktoberfest-discussions on Discord for further details .

grgkaran03 commented 2 years ago

Hi, I would like to take up this issue. Can you please tell me the approach to how I can start working on this?

Udayraj123 commented 2 years ago

Hi @grgkaran03, thanks for showing interest. Let's discuss it on discord and then you can share your brief summary of things to do over here in a comment. Ping me in the channel mentioned in the description.

Udayraj123 commented 1 year ago

Hi @grgkaran03, any updates/need help with anything?

Udayraj123 commented 1 year ago

This task would be under a PR with an ongoing work for improving the debugging experience.

Udayraj123 commented 8 months ago

Sharing a sample histogram where the MIN_JUMP configuration seemed to be ineffective

and somehow the global threshold is also too high because the overall image is bright.

Udayraj123 commented 8 months ago

Analysis: The global threshold logic was not working for this q-vals plots. Because the minimum value was too high. (q-vals indicates list of mean pixel values of all bubbles in the omr template)

Setting it to 100 is also not separating the red and green lines (ideally red line should auto-correct itself to the first large gap)

This happens when there's no sharp jump between to consecutive values in the above histogram

A confidence metric is needed when there is not "clear first large jump" as it is likely to wrongly detect a few bubbles near-by that threshold (unless of-course a local threshold saves that case)

For a particular set of images, we can configure the MIN_JUMP parameter to solve this via config.json:

{
  "threshold_params": {
    "MIN_JUMP": 15
  }
}

But reducing the MIN_JUMP increases wrong detections for images with shadows/low contrast shades.

For example, in above plot, the positions 40-50 may potentially have marked bubbles with low contrast. The local thresholding technique should clear the issue most of the times, but OMRChecker is less confident about such cases.

The confidence metric should help us identify the same and potentially find a solution. We can try labelling the questions in the plot itself to gather some insights.

Udayraj123 commented 8 months ago

Added code to support field labels in the intensity plot to understand the ambiguity better.

If for any single field, the threshold turns out to be completely below the bubble values in that field (despite having a marked bubble) then we've probably set a wrong threshold (reduce confidence metric)
If the field labels are in close vicinity of the threshold(zoomed image), we need to ensure that local thresholding is handling those cases (field wise graphs)
For roll_5 - global threshold is really close, but just enough to distinguish
For q52 - empty field - global threshold and local threshold align
For q72 - empty field - global threshold and local threshold don't align, local threshold distinguishes due to low MIN_JUMP (false positive)

Udayraj123 commented 7 months ago

Turns out the confidence metric to show local vs global threshold disparity is already showing results!

In this scan from community samples we see an ambiguous bubble mark(see Q.131):

It was found when looking at the confidence metrics output:

Such bubbles may require human intervention or better tuning to avoid uniform output across images

Master branch	New output

We've decided to let user's intention to mark be considered thus the bubble will now be marked even if it is not fully filled.

Note: Still if your images contain bad quality prints, where the printed characters('B' in above case) are non-uniformly thick/bold, they may get detected as marked bubbles.

Udayraj123 / OMRChecker

[Feature][Core] Calculate thresholding confidence using data #39