Open nonprofittechy opened 1 year ago
We think that just measuring the number of white pixels on the page won't be very useful. It's something more like having space between groupings of black text on the page. The first measure will give higher readability scores to very light weight fonts but that's probably not accurate.
One idea might be to measure the space between lines.
I asked chatgpt for a naive approach w/ OpenCV and this seems to make sense:
def analyze_whitespace_with_opencv(image, threshold=200, min_line_length=100):
gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
binary_image = cv2.threshold(gray_image, threshold, 255, cv2.THRESH_BINARY_INV)[1]
contours, _ = cv2.findContours(binary_image, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
total_whitespace_area = 0
total_page_area = image.shape[0] * image.shape[1]
for contour in contours:
x, y, w, h = cv2.boundingRect(contour)
if h > min_line_length:
whitespace_area = w * h
total_whitespace_area += whitespace_area
cv2.rectangle(image, (x, y), (x + w, y + h), (0, 255, 0), 2)
whitespace_ratio = total_whitespace_area / total_page_area
return image, total_whitespace_area, whitespace_ratio
We use field density as a proxy, but whitespace should be measured and compared to some ideal number.
Not clear what the ideal amount of whitespace is