MohrJonas / obsidian-ocr

Obsidian OCR allows you to search for text in your images and pdfs
GNU General Public License v3.0
279 stars 5 forks source link

[FR] Indexing Microsoft PowerPoint/Word files #43

Open khesed opened 1 year ago

khesed commented 1 year ago

Is it possible to extend the algorithm to index other filetypes from Microsoft Office? For example, pptx and docx.

I think there are at least two approach options. The first could be converting pptx and docx files to images for each slide/page and then use OCR on that. This could be done with the unoconv library.

The second would be using an interface that exposes the internal data of those filetypes, like the python-pptx library. This would be more akin to just extending the search function of Obsidian in general, which may be out of the scope of the project. So, I think the first approach might be more reasonable for this project.

MohrJonas commented 1 year ago

Interesting idea. This will definitely involve a lot of work. The main problem I see at the moment is parsing the PPTX and DOCX files while only using JS.

khesed commented 1 year ago

Yeah, I can see how this can be challenging.

There are some individual libraries in pure JS for each file extension, like js-pptx and js-ppt.

And there are ones which try to do it all, like any-text, but then it's needed to dig through the dependencies to see if it's really pure JS.