codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.15k stars 2.12k forks source link

Library is unable to extract authors from washingtonpost.com articles #519

Open WalterGR opened 6 years ago

WalterGR commented 6 years ago

I've tried these 3 URLs:

Transcript:

Python 3.6.4 (default, Feb 12 2018, 10:12:43) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from newspaper import Article
>>> article = Article('https://www.washingtonpost.com/news/arts-and-entertainment/wp/2018/02/10/kim-cattrall-rejects-sarah-jessica-parkers-condolences-stop-exploiting-our-tragedy/')
>>> article.download()
>>> article.parse()
>>> article.authors
[]
>>> article.publish_date
datetime.datetime(2018, 2, 10, 0, 0)
>>> article = Article('https://www.washingtonpost.com/world/national-security/fbi-director-to-face-questions-on-security-clearances-and-agents-independence/2018/02/13/f3e4c706-105f-11e8-9570-29c9830535e5_story.html')
>>> article.download()
a>>> article.parse()
>>> article.authors
[]
>>> article.publish_date
datetime.datetime(2018, 2, 13, 0, 0)
>>> article = Article('https://www.washingtonpost.com/news/fact-checker/wp/2018/02/13/whats-the-immigration-status-of-melania-trumps-parents/')
>>> article.download()
>>> article.parse()
>>> article.authors
[]
>>> 

I'm a software engineer so I can potentially fix this, if you could provide a few pointers on where to look in the code.

Thanks.

WalterGR commented 6 years ago

The Washington Post must have changed their HTML. Of the URLs I mention above, author information for the first 2 is successfully? extracted, but is still not for the 3rd URL.

Trying just now, I get:

>>> article = Article('https://www.washingtonpost.com/news/arts-and-entertainment/wp/2018/02/10/kim-cattrall-rejects-sarah-jessica-parkers-condolences-stop-exploiting-our-tragedy/')
>>> article.download()
>>> article.parse()
>>> article.authors
['Amy B Wang Is A General Assignment Reporter Covering National', 'Breaking News For The Washington Post. She Joined The Post In After Seven Years With The Arizona Republic.']
>>> article.publish_date
datetime.datetime(2018, 2, 10, 0, 0)

>>> article = Article('https://www.washingtonpost.com/world/national-security/fbi-director-to-face-questions-on-security-clearances-and-agents-independence/2018/02/13/f3e4c706-105f-11e8-9570-29c9830535e5_story.html')
>>> article.download()
>>> article.parse()
>>> article.authors
['Ellen Nakashima Is A National Security Reporter For The Washington Post. She Covers Cybersecurity', 'Surveillance', 'Counterterrorism', 'Intelligence Issues. She Has Also Served As A Southeast Asia Correspondent', 'Covered The White House', 'Virginia State Politics. She Joined The Post In', 'Shane Harris Covers Intelligence', 'National Security For The Post.']
>>> article.publish_date
datetime.datetime(2018, 2, 13, 0, 0)

>>> article = Article('https://www.washingtonpost.com/news/fact-checker/wp/2018/02/13/whats-the-immigration-status-of-melania-trumps-parents/')
>>> article.download()
>>> article.parse()
>>> article.authors
[]
>>> article.publish_date
datetime.datetime(2018, 2, 13, 0, 0)
>>> 

The reason I put a question mark above - "successfully? extracted" - is that the authors property for the first 2 URLs is a bio rather than just an author's name. I don't know if that's expected behavior.

Cheers.

Gabriel-Chen commented 6 years ago

I have the same issue when trying to extract authors from NYT. A part of the articles are working properly but there are still a lot of articles do not return any authors.

After reading through the code and I guess it is something about the line 138 and 139 in extractors.py. As far as I understand, the authors extractor is looping through all these keywords in html source code and matching them if it is written by these pattern, is that right? I also saw there is a method 2 there which is more intuitive (for me?). Can you explain a little bit more so maybe I can fix this?