We currently only have Apache Tika as an extraction tool. This doesn't support image content extraction.
Proposed Solution
If we implement Tesseract OCR (or something similar) we can add image content extraction to the extraction service.
Alternatives
Other tools are also acceptable if we investigate first.
Additional Context
This should be extractable from the same endpoint that Tika is extracted from /extract_text, but would require adding a param for extraction type so we can differentiate between Tika and the image extractor.
The response format should be identical.
Problem Description
We currently only have Apache Tika as an extraction tool. This doesn't support image content extraction.
Proposed Solution
If we implement Tesseract OCR (or something similar) we can add image content extraction to the extraction service.
Alternatives
Other tools are also acceptable if we investigate first.
Additional Context
This should be extractable from the same endpoint that Tika is extracted from
/extract_text
, but would require adding a param for extraction type so we can differentiate between Tika and the image extractor. The response format should be identical.