codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.15k stars 2.12k forks source link

Possibility to scrape Articles without installing image libs #453

Closed KillerCodeMonkey closed 7 years ago

KillerCodeMonkey commented 7 years ago

Heyho,

thanks for the great url article scraping tool. But it would be nice to add the image downloading functionality as optional, so we can use your piece of sweet (code) cake without installing all the image libs (PIL, Pillow, libpng, ...).

Background:

it would be nice if i could use your lib on amazons aws lambda without building a custom package with all the needed packages, because you are not able to install system packages like libpng there.

Thanks!

mlapierre commented 7 years ago

You can disable downloading of images via fetch_images. E.g.:

article = Article(url='http://cnn.com', fetch_images=False)

More here. I'm able to use newspaper that way in a docker container without installing any image libs so hopefully that works for you too.

KillerCodeMonkey commented 7 years ago

i already set fetch_images to false.

But i get the error: Unable to import module 'newspaper3k': cannot import name '_imaging'

KillerCodeMonkey commented 7 years ago

okay i got it, now newspaper is found and but newspaper seems to create a data directory .newspaper_scraper which is not on lambda:

START RequestId: 4fc82e64-b3de-11e7-baae-7b689a397efc Version: $LATEST
module initialization error: [Errno 2] No such file or directory: '/home/sbx_user1062/.newspaper_scraper'
KillerCodeMonkey commented 7 years ago

@codelucas maybe it is possible to add an ENV-Variable to set base .news-scraper path?

created #462