Improve text extraction for complex backgrounds

Description

This PR aims to improve the text extraction process for complex backgrounds, such as images or gradients, where the current method struggles to accurately extract the text. The proposed changes involve preprocessing the image before feeding it to pytesseract for OCR.

Summary

Added a new function preprocess_image that takes an image as input and applies the following preprocessing steps:
- Convert the image to grayscale
- Apply a Gaussian blur to reduce noise
- Apply adaptive thresholding to convert the image to black and white
- Dilate the image to join separated parts of the text
Modified the extract_text function to call the preprocess_image function before feeding the image to pytesseract for text extraction.

These changes aim to enhance the accuracy of text extraction when the text is overlaid on a busy background. By preprocessing the image to make the text stand out more against the background, we can improve the OCR results and ensure that the extracted text is more reliable.

Fixes #1.

To checkout this PR branch, run the following command in your terminal:

git checkout sweep/improve-text-extraction

🎉 Latest improvements to Sweep:

Use Sweep Map to break large issues into smaller sub-issues, perfect for large tasks like "Sweep (map): migrate from React class components to function components"
Getting Sweep to format before committing! Check out Sweep Sandbox Configs to set it up.
We released a demo of our chunker, where you can find the corresponding blog and code.

💡 To get Sweep to edit this pull request, you can:

Leave a comment below to get Sweep to edit the entire PR
Leave a comment in the code will only modify the file
Edit the original issue to get Sweep to recreate the PR from scratch

Thetruemank / pytextsniper

Improve text extraction for complex backgrounds #3

Description

Summary