deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.89k stars 599 forks source link

Is there any chance to install this in AWS lambda ? #317

Open adantart opened 4 years ago

adantart commented 4 years ago

I mean, my question is, which libraries and files are necessary to zip them for AWS Lambda purposes ?

jpweytjens commented 4 years ago

It depends on which filetype you want to parse. Textract is just a wrapper for external parsers. Most are python packages, but textract uses external CLI tools as well.

skjortan23 commented 4 years ago

I think what the OP: means is that the resulting zip file when installing all dependencies is> 55M. This means that it is not possible to run on AWS lambda.

This is a showstopper for me as well.

` An error occurred (RequestEntityTooLargeException) when calling the CreateFunction operation: Request must be smaller than 69905067 bytes for the CreateFunction operation

This is likely because the deployment package is 54.9 MB. Lambda only allows deployment packages that are 50.0 MB or less in size. To avoid this error, decrease the size of your chalice application by removing code or removing dependencies from your chalice application. `

jpweytjens commented 4 years ago

I have no experience with AWS Lambda, so thank you for pointing that out.

Most of the dependencies are very small, but SpeechRecognition takes up ~32MB. I'll think about making some dependencies optional to reduce filesize.

skjortan23 commented 4 years ago

@jpweytjens yes to work around this I made a fork yesterday and just commented out the dependency on SpeechRecognition and successfully deployed to lambda.

So having the SpeechRecognition dependency as an optional would be great.