Paper-Chart-Extraction-Project / ChartExtractor

ChartExtractor uses computer vision to convert images of paper charts to digital data.
https://paper-chart-extraction-project.github.io/ChartExtractor/
GNU General Public License v3.0
3 stars 1 forks source link

Clustering Printed Numbers to Form the Legend #29

Open RyanDoesMath opened 1 month ago

RyanDoesMath commented 1 month ago

Problem

The legend of the blood pressure section will provide an easy way to find the timestamp and value of blood pressure recordings. The issue is that the best way to encode the data for training is to label each digit (minimizing the number of additional classes during training), but this isn't the final product, we need to cluster the numbers together to form multi-digit numbers, then use their position to give them more meaningful names (ex: the second '20' to appear in the legend on the top truly is '1:20', while the '20' on the y-axis is '20 mmHg').

blood_pressure_and_heart_rate

Proposed Solution

This problem will need to be solved using clustering.
There are lots of clustering algorithms out there, but all of them accept data as input, and return a mapping between the supplied data and n categories. Here, we want the clusters to be multi-digit numbers, and the supplied data will be the single digit numbers.

Pull Requests

You'll need to submit 3 pull requests.

PR #1: Clustering Experiment (Week 1 and 2)

This pull request will need to be submitted to the ChartExtractorSupplements repository. This repository is for experiments that show proofs of concept for improvements to ChartExtractor, as well as utilities that make working with the data surrounding this project easier.

Create a new directory under experiments/clustering/. Then, write a jupyter notebook where you use clustering algorithms to solve this problem. You should try at least two algorithms and compare them (more is better).

The models we use to find these printed digits are pretty good, but they're not bulletproof. There will be missed detections (where a number appears on the page but has no bounding box) as well as erroneous detections (where a non-number has been detected as a number), and misclassifications (where one number is mistaken for another). You should run additional experiments examining the behavior of the algorithms when there are these kinds of errors.

The one you pick should be resilient to a modest amount of error. You may assume that there will be, at most, 5% of each (missing, erroneous, and misclassified) on a worst case chart.

PR #2: Meaning Imputation Experiment (Week 1 and 2)

This pull request will also be under ChartExtractorSupplements.

Getting the digits clustered is not the end though. You'll need to write an algorithm that applies meaning to the clusters. The clusters in blood pressure legend (the ones that tell how high or low blood pressure is in mmHg) should be called x_mmhg, where x is the pressure reading (ex: 50_mmhg, 130_mmhg, etc), and the timestamp legend (the ones that run horizontally that mark 5 minute epochs of time), should be called x_mins, where x is the number of minutes.

You should consider what might happen in the case where a cluster is totally missing, or a few erroneous detections form a 'fake' cluster.

PR #3: Implementation (Week 3)

This pull request will be under the ChartExtractor repository.

Finally, you'll implement your solution in a new directory src/extraction/blood_pressure_clustering/ (or, if you only plan on having one file, src/extraction/blood_pressure_clustering.py. You should expose a single function find_legend_locations_from_digit_landmarks (or something similar if you think of a better name). This function will take a list of Detection objects, and return a dictionary where the keys are x_mins and y_mmhg for each x, and y in the legend's timestamps and pressure section respectively, and the values are a single Point object with $(x, y) \in [0, 1]$ being the percent from the left and top of the page the landmarks exist (ex: (0, 0) is the top left, (0.5, 0.5) is the center, (1, 0) is the top left, etc).

Your solution should have docstrings for every module and function, be formatted properly, and linted.

Helpful Information

Here are some hints and helpful givens/information: 1) Landmarks tend to be very stable from image-to-image due to the homography correction. They aren't foolproof, but if something is even a small distance away, it can be easily detected as incorrect. 2) I would recommend the first step be separating the timestamp digits from the pressure digits. From there, the clustering is actually one dimensional. 3) You might not even need the classifications of the boxes themselves due to the structure of the legend.

charbelmarche33 commented 1 month ago

@RyanDoesMath Is there an already complete way to isolate this area of the chart? I realize I might be recreating the wheel here, but I'm wondering if we do this elsewhere.

Image

What I think we need to do is either isolate that region and extract the bounding boxes that are within that area to do the clustering on. Currently, I'm using cv2.selectROI to do this, yielding the below:

Image

RyanDoesMath commented 1 month ago

This was basically the solution in the previous program, except it was done pre-detection.
I think it could be a good solution, but I would need to know the group's thinking before I can say anything specific (there might be a simpler way). We can talk about this and some other small points at Tuesday's meeting.

charbelmarche33 commented 1 month ago

Sounds good. I continued with the above method to perform clustering using kmeans and obtain a 99.75% accuracy on time stamps and 100% accuracy on mmHg/heart rate given the data provided on email. I did have some questions about this, however, that I wanted to discuss (figured I'd put them here for reference but we can talk about them Tuesday):

Labels for clusters

The clusters in blood pressure legend (the ones that tell how high or low blood pressure is in mmHg) should be called x_mmhg, where x is the pressure reading (ex: 50_mmhg, 130_mmhg, etc), and the timestamp legend (the ones that run horizontally that mark 5 minute epochs of time), should be called x_mins, where x is the number of minutes.

With regards to this, what are we changing the name of to be called 50_mmhg (for example). I currently am creating the two sets of clusters and doing 2 things with them:

This would be easy to change to the formatting you probably want, I'm just not sure where I am adding those labels!

Assessing models

The one you pick should be resilient to a modest amount of error. You may assume that there will be, at most, 5% of each (missing, erroneous, and misclassified) on a worst case chart. You should consider what might happen in the case where a cluster is totally missing, or a few erroneous detections form a 'fake' cluster.

This is likely a skill issue on my part, but I don't quite understand how to test this. Would it be sufficient to just drop some bounding boxes and see how the model responds? Or add some random bounding boxes and see if the model appropriately ignores them? I guess my question boils down to what is the desired behavior here and then figuring out how to test that.

As I said the current method I am using is kmeans, and as a part of that you predetermine the number of clusters you expect for both time and mmhg/HR. This technique obviously can fall short if we want something more flexible that can determine, "Hey, we expect 42 time stamps, but we only see 41 good ones (maybe because one wasn't detected). So, let's not split a cluster in half for no reason". An example of this not happening is in the screenshot below. Once I know what we want the expected behavior to be I can adjust the method I'm using to account for that.

See how 0 was not detected bc the time was written over it and thus 35 was split in half. RC_0005_intraoperative

How to select the region of interest

Discussed above briefly.

RyanDoesMath commented 1 month ago

Looks good so far!

Looking for Clusters

To get the bp section algorithmically, the easiest way I would think to do this would be to filter out any non-digit detection, then select only digits between 0.2 and 0.8 on the y axis, then cropping from there.

Something like

from operator import attrgetter
from typing import List

def crop_bp_section(
    image: Image.Image, 
    obj_detections: List[Detection], 
    buffer_pixels: int=5
) -> Image.Image:
    """Crops the blood pressure section out of an image of a chart.

    Args:
        image (Image.Image):
            The chart photograph.
        obj_detections (List[Detection]):
            The object detections made by the computer vision models to identify landmarks and
            handwritten symbols.
        buffer_pixels (int):
            An optional integer that specifies the number of pixels around the digit detections to
            'zoom out' by. Defaults to 5 pixels.

    Returns:
        A cropped out blood pressure section.
    """
    # Get bounding boxes from detections and filter non bounding boxes out.
    bboxes: List[BoundingBox] = [det.annotation for det in obj_detections]
    bboxes: List[BoundingBox] = list(filter(lambda ann: isinstance(ann, BoundingBox), bboxes))

    digit_categories: List[str] = [str(i) for i in range(10)]
    im_height = image.size[1]

    # Filter bounding boxes to those which are within the approximate region and are digits.
    bp_legend_digits: List[BoundingBox] = list(filter(
        lambda bb: all(
            bb.top/im_height > 0.2, 
            bb.top/im_height < 0.8, 
            bb.category in digit_categories
        ), 
        bboxes
    ))
    bp_legend_coordinates: List[float] = [
        min(bp_legend_digits, attrgetter("left")) - buffer_pixels,
        min(bp_legend_digits, attrgetter("top")) - buffer_pixels,
        max(bp_legend_digits, attrgetter("right")) + buffer_pixels,
        max(bp_legend_digits, attrgetter("bottom")) + buffer_pixels
    ]
    return image.crop(bp_legend_coordinates)

You may not even need this solution though. If you look at a histogram of only the Y values, you should see a large stack of digit detections that represents the timestamps, and if you likewise look at a histogram of only the X values, you will see a large stack of digit detections that represents the bp/hr legend. The mode of a KDE of these distributions should give the approximate point you are looking for.

Assessment

To assess the robustness of the solution, I would recommend randomly adding and removing detections, up to 5% of the total detections. Your point this will sometimes cause whole clusters to disappear is well noted, your solution needs to be able to deal with this because, given enough chart images, it will happen.

What I recommend is to use the silhouette score and looping over a range of clusters (you can determine the low end of the range by deriving a worst case scenario based on 5% of the detections being missing, and the high end will just be the number of clusters which should be there if all digits are correct).

For the assessment, make sure you are not just ensuring the legend entries exist, but that they are in the correct location as well. I will post another dataset later today tomorrow that will make this easier.

Post-Script on Objects

As an aside, I would suggest using more of the provided methods in BoundingBox rather than manipulating lists of [x1, y1, x2, y2] values. BoundingBox already has methods for finding the center, reading and writing from yolo format, and will validate whether a box is valid when you construct it.

I looked over the Jupyter notebook, and I think a lot of the work done is in repeating functionality that is in BoundingBox, or can be made much easier with BoundingBoxes as a data structure. Moving forward, see if you can think of things in terms of filtering, mapping, and reducing lists of BoundingBox objects.

charbelmarche33 commented 1 month ago

Next Steps

Put together a suggested roadmap and checklist @mattbeck1 @hvalenty. Feel free to adjust as you see fit.

charbelmarche33 commented 2 weeks ago

@RyanDoesMath FYI the bp_and_hr_cluster_locations.json file has an extra "190_mins" in it. Also, there are 20 sheets in the bp_and_hr_cluster locations file while the data we have is only 19 sheets.