This PR aims to improve the text extraction process for complex backgrounds, such as images or gradients, where the current method struggles to accurately extract the text. The proposed changes involve preprocessing the image before feeding it to pytesseract for OCR.
Summary
Added a new function preprocess_image that takes an image as input and applies the following preprocessing steps:
Convert the image to grayscale
Apply a Gaussian blur to reduce noise
Apply adaptive thresholding to convert the image to black and white
Dilate the image to join separated parts of the text
Modified the extract_text function to call the preprocess_image function before feeding the image to pytesseract for text extraction.
These changes aim to enhance the accuracy of text extraction when the text is overlaid on a busy background. By preprocessing the image to make the text stand out more against the background, we can improve the OCR results and ensure that the extracted text is more reliable.
Fixes #1.
To checkout this PR branch, run the following command in your terminal:
git checkout sweep/improve-text-extraction
🎉 Latest improvements to Sweep:
Use Sweep Map to break large issues into smaller sub-issues, perfect for large tasks like "Sweep (map): migrate from React class components to function components"
Getting Sweep to format before committing! Check out Sweep Sandbox Configs to set it up.
We released a demo of our chunker, where you can find the corresponding blog and code.
💡 To get Sweep to edit this pull request, you can:
Leave a comment below to get Sweep to edit the entire PR
Leave a comment in the code will only modify the file
Edit the original issue to get Sweep to recreate the PR from scratch
Description
This PR aims to improve the text extraction process for complex backgrounds, such as images or gradients, where the current method struggles to accurately extract the text. The proposed changes involve preprocessing the image before feeding it to pytesseract for OCR.
Summary
preprocess_image
that takes an image as input and applies the following preprocessing steps:extract_text
function to call thepreprocess_image
function before feeding the image to pytesseract for text extraction.These changes aim to enhance the accuracy of text extraction when the text is overlaid on a busy background. By preprocessing the image to make the text stand out more against the background, we can improve the OCR results and ensure that the extracted text is more reliable.
Fixes #1.
To checkout this PR branch, run the following command in your terminal:
🎉 Latest improvements to Sweep:
💡 To get Sweep to edit this pull request, you can: