codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.06k stars 2.11k forks source link

Can't extract images from german newspapers #96

Closed boop5 closed 9 years ago

boop5 commented 9 years ago

Works fine for me with many english newspapers. Unfortunately it works horrible with german newspapers. I've tried many of them...

I was following the how-to on readthedocs.

# ...
p = newspaper.build('...', language='de')
for a in p.articles:
    print a.top_image if a.top_image else 'nope'

I only get nopes ;-)

Is there a way I can help you debugging this?

codelucas commented 9 years ago

Are you using newspaper3k (the python 3 -- default) version? Or the python 2 version?

boop5 commented 9 years ago

2 - installed with pip

Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2

from version.py

version_info = (0, 0, 8)

this random article is working on the newspaper demo but not in my script.

oh and something else. when I'm trying to build() the same page twice - it won't find anything on the second time. I'm probably doing something wrong here. Guess it's cache related


Well let me update this again. Check this demo and then check the article itself I think it should be obvious that the extracted top_image isn't correct, right? :relaxed: In this case it would be easy to use og:image - just thinking loud right now.

codelucas commented 9 years ago

It should be pulling from og:image ... I'll look into this as soon as possible.

boop5 commented 9 years ago

Okay, thanks so far :+1:

codelucas commented 9 years ago

Can't replicate your first example:

Python 2.7.3 (default, Sep 26 2013, 20:03:06) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> url = "http://www.bild.de/sport/fussball/eintracht-frankfurt/zambrano-ich-wollte-okazaki-nicht-weh-tun-37972970.bild.html"
>>> from newspaper import Article
>>> a= Article(url)
>>> a.download()
>>> a.parse()
>>> 
>>> a.top_img
u'http://bilder.bild.de/fotos/eintrachtfrankfurt-1-fsvmainz05_41024887_mbqf-1412163090-37973010/Bild/2.bild.jpg'

The top image extraction worked as expected.

when I'm trying to build() the same page twice - it won't find anything on the second time. I'm probably doing something wrong here. Guess it's cache related

This is expected behavior: newspaper caches articles automatically. You can change this behavior if you'd like, read more in the docs.

Reference our docs: http://newspaper.readthedocs.org/en/latest/user_guide/quickstart.html#article-caching

^ So many people have asked about that now (the article caching) that I've just accepted that I fucked up the API, should probably open an issue to just have users handle caching ...

codelucas commented 9 years ago

For your second example, I believe you've caught a great bug.

The image extracted from og: and <meta> tags are actually being filtered out from this particular line of code.

https://github.com/codelucas/newspaper/blob/python-2-head/newspaper/images.py#L227

The original line of reasoning was that we should always filter out images that are too small or have skewed dimensions. But on second thought, if it's either og: or <meta> it really should not be filtered. The filter mechanism should only take place if no "top image" is specified in the HTML and we need to check image by image for the largest one -- newspaper searches for the top_img via largest image selection after it can't find a <meta>/og: tag

This will be fixed in both branches shortly!

codelucas commented 9 years ago

Your second example has been fixed on the python2 branch:

The fix was two-fold, the first bug was that meta/og images were being filtered against size and dimensions. The second bug (unfortunately really core also) spawned out of newspaper eliminating article GET params from URLs: bad because some news sites use GET params to identify articles.

https://github.com/codelucas/newspaper/commit/d8f836de956a991099a2e14d60940962c6c7d4a2 https://github.com/codelucas/newspaper/commit/2a4912fd490c32f414704edd266bb6b959d0730a

Result:

Python 2.7.3 (default, Sep 26 2013, 20:03:06) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from newspaper import Article
>>> url = "http://www.hr-online.de/website/rubriken/nachrichten/indexhessen34938.jsp?rubrik=36094&key=standard_document_53891717"
>>> a = Article(url, language='de')
>>> a.download()
>>> a.parse()
>>> a.top_img
u'http://www.hr-online.de/servlet/de.hr.cms.servlet.IMS?enc=d3M9aHJteXNxbCZibG9iSWQ9MjEyNDYwNTMmaWQ9NTM4OTI3MDM_'