AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
MIT License
499 stars 51 forks source link

Library is unable to extract authors from washingtonpost.com articles #174

Open AndyTheFactory opened 1 year ago

AndyTheFactory commented 1 year ago

Issue by WalterGR Tue Feb 13 17:52:01 2018 Originally opened as https://github.com/codelucas/newspaper/issues/519


I've tried these 3 URLs:

Transcript:

Python 3.6.4 (default, Feb 12 2018, 10:12:43) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from newspaper import Article
>>> article = Article('https://www.washingtonpost.com/news/arts-and-entertainment/wp/2018/02/10/kim-cattrall-rejects-sarah-jessica-parkers-condolences-stop-exploiting-our-tragedy/')
>>> article.download()
>>> article.parse()
>>> article.authors
[]
>>> article.publish_date
datetime.datetime(2018, 2, 10, 0, 0)
>>> article = Article('https://www.washingtonpost.com/world/national-security/fbi-director-to-face-questions-on-security-clearances-and-agents-independence/2018/02/13/f3e4c706-105f-11e8-9570-29c9830535e5_story.html')
>>> article.download()
a>>> article.parse()
>>> article.authors
[]
>>> article.publish_date
datetime.datetime(2018, 2, 13, 0, 0)
>>> article = Article('https://www.washingtonpost.com/news/fact-checker/wp/2018/02/13/whats-the-immigration-status-of-melania-trumps-parents/')
>>> article.download()
>>> article.parse()
>>> article.authors
[]
>>> 

I'm a software engineer so I can potentially fix this, if you could provide a few pointers on where to look in the code.

Thanks.

AndyTheFactory commented 1 year ago

Comment by WalterGR Mon Apr 30 04:34:45 2018


The Washington Post must have changed their HTML. Of the URLs I mention above, author information for the first 2 is successfully? extracted, but is still not for the 3rd URL.

Trying just now, I get:

>>> article = Article('https://www.washingtonpost.com/news/arts-and-entertainment/wp/2018/02/10/kim-cattrall-rejects-sarah-jessica-parkers-condolences-stop-exploiting-our-tragedy/')
>>> article.download()
>>> article.parse()
>>> article.authors
['Amy B Wang Is A General Assignment Reporter Covering National', 'Breaking News For The Washington Post. She Joined The Post In After Seven Years With The Arizona Republic.']
>>> article.publish_date
datetime.datetime(2018, 2, 10, 0, 0)

>>> article = Article('https://www.washingtonpost.com/world/national-security/fbi-director-to-face-questions-on-security-clearances-and-agents-independence/2018/02/13/f3e4c706-105f-11e8-9570-29c9830535e5_story.html')
>>> article.download()
>>> article.parse()
>>> article.authors
['Ellen Nakashima Is A National Security Reporter For The Washington Post. She Covers Cybersecurity', 'Surveillance', 'Counterterrorism', 'Intelligence Issues. She Has Also Served As A Southeast Asia Correspondent', 'Covered The White House', 'Virginia State Politics. She Joined The Post In', 'Shane Harris Covers Intelligence', 'National Security For The Post.']
>>> article.publish_date
datetime.datetime(2018, 2, 13, 0, 0)

>>> article = Article('https://www.washingtonpost.com/news/fact-checker/wp/2018/02/13/whats-the-immigration-status-of-melania-trumps-parents/')
>>> article.download()
>>> article.parse()
>>> article.authors
[]
>>> article.publish_date
datetime.datetime(2018, 2, 13, 0, 0)
>>> 

The reason I put a question mark above - "successfully? extracted" - is that the authors property for the first 2 URLs is a bio rather than just an author's name. I don't know if that's expected behavior.

Cheers.

AndyTheFactory commented 1 year ago

Comment by Gabriel-Chen Tue Jul 10 22:26:09 2018


I have the same issue when trying to extract authors from NYT. A part of the articles are working properly but there are still a lot of articles do not return any authors.

After reading through the code and I guess it is something about the line 138 and 139 in extractors.py. As far as I understand, the authors extractor is looping through all these keywords in html source code and matching them if it is written by these pattern, is that right? I also saw there is a method 2 there which is more intuitive (for me?). Can you explain a little bit more so maybe I can fix this?