Is it possible to extend the algorithm to index other filetypes from Microsoft Office? For example, pptx and docx.
I think there are at least two approach options. The first could be converting pptx and docx files to images for each slide/page and then use OCR on that. This could be done with the unoconv library.
The second would be using an interface that exposes the internal data of those filetypes, like the python-pptx library. This would be more akin to just extending the search function of Obsidian in general, which may be out of the scope of the project. So, I think the first approach might be more reasonable for this project.
Interesting idea.
This will definitely involve a lot of work.
The main problem I see at the moment is parsing the PPTX and DOCX files while only using JS.
Is it possible to extend the algorithm to index other filetypes from Microsoft Office? For example, pptx and docx.
I think there are at least two approach options. The first could be converting pptx and docx files to images for each slide/page and then use OCR on that. This could be done with the
unoconv
library.The second would be using an interface that exposes the internal data of those filetypes, like the
python-pptx
library. This would be more akin to just extending the search function of Obsidian in general, which may be out of the scope of the project. So, I think the first approach might be more reasonable for this project.