SuffolkLITLab / FormFyxer

A tool for learning about and pre-processing forms
MIT License
11 stars 1 forks source link

Directly measure whitespace #85

Open nonprofittechy opened 1 year ago

nonprofittechy commented 1 year ago

We use field density as a proxy, but whitespace should be measured and compared to some ideal number.

Not clear what the ideal amount of whitespace is

nonprofittechy commented 1 year ago

We think that just measuring the number of white pixels on the page won't be very useful. It's something more like having space between groupings of black text on the page. The first measure will give higher readability scores to very light weight fonts but that's probably not accurate.

One idea might be to measure the space between lines.

nonprofittechy commented 1 year ago

I asked chatgpt for a naive approach w/ OpenCV and this seems to make sense:

def analyze_whitespace_with_opencv(image, threshold=200, min_line_length=100):
    gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    binary_image = cv2.threshold(gray_image, threshold, 255, cv2.THRESH_BINARY_INV)[1]

    contours, _ = cv2.findContours(binary_image, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

    total_whitespace_area = 0
    total_page_area = image.shape[0] * image.shape[1]

    for contour in contours:
        x, y, w, h = cv2.boundingRect(contour)
        if h > min_line_length:
            whitespace_area = w * h
            total_whitespace_area += whitespace_area
            cv2.rectangle(image, (x, y), (x + w, y + h), (0, 255, 0), 2)

    whitespace_ratio = total_whitespace_area / total_page_area

    return image, total_whitespace_area, whitespace_ratio