elastic / data-extraction-service

Other
9 stars 1 forks source link

Add image content extraction #34

Open navarone-feekery opened 7 months ago

navarone-feekery commented 7 months ago

Problem Description

We currently only have Apache Tika as an extraction tool. This doesn't support image content extraction.

Proposed Solution

If we implement Tesseract OCR (or something similar) we can add image content extraction to the extraction service.

Alternatives

Other tools are also acceptable if we investigate first.

Additional Context

This should be extractable from the same endpoint that Tika is extracted from /extract_text, but would require adding a param for extraction type so we can differentiate between Tika and the image extractor. The response format should be identical.

seanstory commented 6 months ago

Fun fact, Tika plays nice with Tesseract, so we wouldn't need another service or anything for this. See: https://cwiki.apache.org/confluence/display/TIKA/TikaOCR