Text search - Githubissues

kirillt commented 2 years ago

It should be possible to filter documents by presence of query word in their text content. For this, it's necessary to implement:

Text-layer extraction from PDF and other text-based documents. This must happen during folder indexing.
New button or menu option in Resources Grid screen, which displays text input box. The string from the input should be searched in text layers of all resources. Matched resources must be displayed.

sisco0 commented 2 years ago

I would suggest we would go further on this and we could use Tesseract for OCR text recognition of images and PDF English (and possible other languages) documents. In this way, we could have text metadata attached to each PDF and image files and not only plain text files.

The next observations must be taken into account:

It should be studied what could be done for Microsoft Office, LibreOffice (rich-formatted).
It should be studied what could be done with binary files.
If we are required, we could use Tesseract TryGetBoundingBox function for highlighting results in PDF and image files at a detailed search results view.
For rich-formatted documents we should use other solution as the one explained in the point above.

kirillt commented 2 years ago

Good thoughts, I've just created separate issue for text layer, since it can also be used for tags suggestions: #183

ARK-Builders / ARK-Navigator

Text search #178