deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.86k stars 592 forks source link

FR: Make SpeechRecognition etc. large AI libs just "extra" dependencies. #451

Open kxrob opened 1 year ago

kxrob commented 1 year ago

SpeechRecognition is a massive dependency, like pocketsphinx. And possibly others too. Making those dependencies "extra" would remove a lot distribution load, burden and install errors.

Anyway one wouldn't expect a tool "textract" to run complex AI recognition tools just so light-mindedly - which are instable and non-deterministic. One wouldn't use those in serious projects. Usually such file types need to be filtered before letting textract try.

So these massiv AI libraries should better all become "extra" dependencies at least.

pencil commented 1 year ago

I came here to suggest the same thing: It would be great if textract was more lightweight by default. I only need something to extract text from common document formats such as .pdf, .rtf, .docx. The dependency on SpeechRecognition is problematic because its massive size greatly slows down build time of our project and increases the size of the resulting Docker image substantially.

As @kxrob suggested, the dependency could be moved to "extra" and the tool could provide clear instructions if the package is unavailable when trying to extract text from an audio file, e.g. "Extracting text from audio files is an optional feature. Please run pip install SpeechRecognition~=3.8.1".