freedmand / semantra

Multi-tool for semantic search
MIT License
2.48k stars 138 forks source link

Support Microsoft Office file formats #23

Open ellipticview opened 1 year ago

ellipticview commented 1 year ago

Most of the documents I would like to search are in ppt or pptx format (Powerpoints). Would be nice if Powerpoint and Word documents can be indexed, even without a preview option.

caojinbo commented 1 year ago

This will be an excellent feature to add.

freedmand commented 1 year ago

Looking into Apache Tika for this via tika-python. It does require Java to be installed but seems robust and permissively licensed. Open to another solution that has fewer dependencies, but I haven't found a good one yet