codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
13.99k stars 2.11k forks source link

Running in Amazon Lambda #237

Open JamesChevalier opened 8 years ago

JamesChevalier commented 8 years ago

I got this running in an Amazon Lambda function, and I wanted to share how I did it just in case it was useful for others.

This gist covers the details, including the Lambda function itself. The one huge caveat is that this only applies to the Python 2.7 version, because that's what Amazon Lambda provides.

I wasn't sure where to put this, since it isn't really an Issue (in the problematic sense). I also didn't want to be the first person to add to the Wiki, especially with something so specific.

yprez commented 8 years ago

@JamesChevalier thanks, looks useful! Not sure where to put this either...

bisoldi commented 7 years ago

Lambda now supports Python 3.6. Any thoughts on how to get Newspaper 3 deployed to AWS Lambda? I've been trying to figure out how to build Newspaper 3 on an EC2 instance (with the same AMI as Lambda), however the Amazon Linux doesn't come with Python 3.6 and I can't get it installed. Unfortunately, that's the extent of my Python knowledge...

If you have any thoughts / suggestions, I'm happy to continue working on it as I would love to get it running as a standalone service.

Thanks!

JamesChevalier commented 7 years ago

What trouble(s) are you running into when attempting to install Python 3.6 in Amazon Linux?

Another approach that might work is to try doing the build process through LambCI's Lambda Docker image: https://hub.docker.com/r/lambci/lambda/

bisoldi commented 7 years ago

Thanks for responding. Well, the AMI doesn't have Python 3.6 in the yum repository and I haven't found any instructions on how to install it without yum. It has 3.4 and 3.5 but I wasn't sure if building against either would work in a 3.6 runtime.

vitaly-zdanevich commented 7 years ago

AWS Lambda can write only to /tmp, so in settings.py we need to change DATA_DIRECTORY from .newspaper_scraper to /tmp/.newspaper_scraper. Also I do not know how to determine from Python that now we run inside AWS Lambda - maybe check for environment variable like AWS_LAMBDA_FUNCTION_NAME?

bisoldi commented 7 years ago

I was finally able to deploy newspaper3k to AWS Lambda via Codebuild --> Cloudformation, however I can only get the download() and parse() functions to work. Calling nlp() throws an SQLite error by the NLTK. library I've done some searching and communicated with AWS about this and it appears that SQLite is expected to be embedded within Python and the Python 3.6 runtime on Lambda does not have it. I've tried compiling and building SQLite into my app, but that didn't work. I've filed a request with AWS to both create an AMI with a Python 3.6 environment for Codebuild and to embed SQLite into the Python 3.6 runtime.

cdimitroulas commented 6 years ago

any update on this?

bisoldi commented 6 years ago

I don't have one, except to say that AWS just recently released an AMI that has Python 3.6 already installed and when I filed the request, they did indicate they already knew about the SQLite issue and were considering adding it in. I haven't checked it though....

byrro commented 6 years ago

Thank you @vitaly-zdanevich , it nailed problem!

will3216 commented 6 years ago

For those who want to run newspaper3k on aws lambda I got it working, and published this template to hopefully save people some time! https://github.com/will3216/newspaper3k_lambda_template

The dependencies are pre-built and checked in, works with nltk and whatnot. Instructions for adding additional dependencies are included in the readme, but by default should work out-of-box

bisoldi commented 6 years ago

@will3216 Dude....How did you get the NLTK stuff to work?

I spent far more time than I care to admit trying to get sqllite3 to work in Lambda and couldn't get it to work! AWS even confirmed it's a known issue!

will3216 commented 6 years ago

@bisoldi Ha! Yeah, that was a pain... I manually copied in a file AWS's python build was missing from this project https://github.com/Miserlou/lambda-packages

I just made some changes to the template I posted above which now allows you to modify the dependencies you are using by using docker to spin up an Amazon Linux AMI to dynamically build/package your lambda function along with its dependencies.

bisoldi commented 6 years ago

@will3216 I also got it to work by simply dropping the sqllite library in. I then integrated CircleCI and in a different repo implemented the modified newspaper library. If there is any interest, I might open source the modifications.

I assumed that would not be a change acceptable in a PR, unless @codelucas wants it?

vitaly-zdanevich commented 5 years ago

This issue is resolved - looks like it enough to have in settings.py: tempfile.gettempdir()

Against sqlite I have this:

sys.modules['sqlite'] = imp.new_module('sqlite')
sys.modules['sqlite3.dbapi2'] = imp.new_module('sqlite.dbapi2')

See https://stackoverflow.com/a/44532317/1879101

UPD: ok I agree - it is better do not use sqlite when a client code does not use it too - for KISS of user, I hope that it will be implemented too.

palmerabollo commented 4 years ago

Does anyone know a Lambda Layer containing all the requirements to run newspaper on AWS Lambda?

bisoldi commented 4 years ago

I haven't open sourced it (yet??) but I did get it to work as a layer.

Aditya94A commented 4 years ago

Does anyone have a 2020 way of doing this with the latest library version?

Aditya94A commented 4 years ago

@bisoldi Please do open source your solution, it would be extremely helpful to everyone 😁

bisoldi commented 4 years ago

I think I might....though, I have found a great deal of inaccuracies with respect to extracting the article's publish date. Do you (does anyone) know if there are there any improvements possible in that area?