CDCgov / IDWA

Intelligent Data Workflow Automation
Apache License 2.0
1 stars 1 forks source link

Spike: Explore Techniques we can use to improve the quality of our Text extraction #135

Closed arinkulshi-skylight closed 3 weeks ago

arinkulshi-skylight commented 2 months ago

Identify techniques to improve the quality of our text extraction and paragraph extraction:

Some ideas include:

  1. Sharpening the resolution
  2. Noise Removal
  3. Binarization
  4. Dilation and Erosion
  5. Edge detection
  6. Increasing contrast
  7. Removing the slashes/boxes/lines from the segments
  8. Splitting our segments into smaller chunks (line by line or word by word)

Write up a one page doc/slides outlining each technique and its potential effectiveness as improving text extraction.

Image

Image

Some Tools such as openCV/pytorch might have these as utility functions we can look into that as part of the spike.

bora-skylight commented 1 month ago

@jonchang I'm imaging the outcome of this ticket to be a recommendation on at least one technique that we should implement in our product. Does that sound correct to you? Given that there are so many options outlined above, I want to clarify that the expectation here is not to research every single one of them, but to move towards implementation for the first one that makes any measurable difference. Otherwise I think this becomes a huge ticket.

What are your thoughts?

bora-skylight commented 1 month ago

@jonchang can you please provide an update here? Should this be transitioned to the next sprint?

jonchang commented 3 weeks ago

Spike research doc - https://docs.google.com/document/d/1rDbFUSYBAFLQ77HgOuxajwZAJ_2ZhEXAfOOYFwiz08Y/edit?usp=sharing