dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)
https://fscrawler.readthedocs.io/
Apache License 2.0
1.36k stars 299 forks source link

Where's the HOCR output #1496

Open vdende opened 2 years ago

vdende commented 2 years ago

I have set the output_type to hocr.

But where can I find it? I would expect the output to be stored somewhere. I read in the Tesseract documentation it is possible.

vdende commented 1 year ago

Hi @dadoonet , can someone follow up on this? We'll need to send the OCR text to Elasticsearch and store the 'hocr' output. In the documentation of Tesseract I see it is possible by adding 'hocr' at the end of the command: https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html

dadoonet commented 1 year ago

I answered in https://github.com/dadoonet/fscrawler/discussions/1594

Let me know :)

vdende commented 1 year ago

I answered there as well 😀

vdende commented 1 year ago

Hi @dadoonet. Any news on this topic? Your last remark was:

I think we need to see how Tika supports this option and if something is needed in FSCrawler to enable this.