dbashford / textract

node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!
MIT License
1.64k stars 185 forks source link

Support for .mobi extension? #107

Open josephrocca opened 7 years ago

josephrocca commented 7 years ago

I was going to mention this in #106 but I figured it'd be better to create a separate issue. .mobi is another very popular format for ebooks (kindle uses it). It'd be very nice if this lib could handle this extension :)

jsinmotion commented 7 years ago

MOBI looks like binary soup: http://wiki.mobileread.com/wiki/MOBI

pandoc seems to lean on kindlegen: https://www.amazon.com/gp/feature.html?docId=1000765211 -- maybe we should just interface with that command in the same way that antiword is used?

josephrocca commented 7 years ago

Ah yeah sounds like it would make things easier. License stuff though? https://www.amazon.com/gp/feature.html?docId=1000599251

jsinmotion commented 7 years ago

Is the license relevant if textract isn't shipping with kindlegen? It could use kindlegen if available and some other tool or hand-rolled solution if the target doesn't have it available. DOC support is provided by textutil on OSX and antiword on linux, neither of which are installed by the textract npm module.

josephrocca commented 7 years ago

Ah yeah, good point! If it isn't shipping with it then that makes sense :+1: