Open AndyTheFactory opened 1 year ago
Comment by WalterGR Mon Apr 30 04:34:45 2018
The Washington Post must have changed their HTML. Of the URLs I mention above, author information for the first 2 is successfully? extracted, but is still not for the 3rd URL.
Trying just now, I get:
>>> article = Article('https://www.washingtonpost.com/news/arts-and-entertainment/wp/2018/02/10/kim-cattrall-rejects-sarah-jessica-parkers-condolences-stop-exploiting-our-tragedy/')
>>> article.download()
>>> article.parse()
>>> article.authors
['Amy B Wang Is A General Assignment Reporter Covering National', 'Breaking News For The Washington Post. She Joined The Post In After Seven Years With The Arizona Republic.']
>>> article.publish_date
datetime.datetime(2018, 2, 10, 0, 0)
>>> article = Article('https://www.washingtonpost.com/world/national-security/fbi-director-to-face-questions-on-security-clearances-and-agents-independence/2018/02/13/f3e4c706-105f-11e8-9570-29c9830535e5_story.html')
>>> article.download()
>>> article.parse()
>>> article.authors
['Ellen Nakashima Is A National Security Reporter For The Washington Post. She Covers Cybersecurity', 'Surveillance', 'Counterterrorism', 'Intelligence Issues. She Has Also Served As A Southeast Asia Correspondent', 'Covered The White House', 'Virginia State Politics. She Joined The Post In', 'Shane Harris Covers Intelligence', 'National Security For The Post.']
>>> article.publish_date
datetime.datetime(2018, 2, 13, 0, 0)
>>> article = Article('https://www.washingtonpost.com/news/fact-checker/wp/2018/02/13/whats-the-immigration-status-of-melania-trumps-parents/')
>>> article.download()
>>> article.parse()
>>> article.authors
[]
>>> article.publish_date
datetime.datetime(2018, 2, 13, 0, 0)
>>>
The reason I put a question mark above - "successfully? extracted" - is that the authors
property for the first 2 URLs is a bio rather than just an author's name. I don't know if that's expected behavior.
Cheers.
Comment by Gabriel-Chen Tue Jul 10 22:26:09 2018
I have the same issue when trying to extract authors from NYT. A part of the articles are working properly but there are still a lot of articles do not return any authors.
After reading through the code and I guess it is something about the line 138 and 139 in extractors.py
. As far as I understand, the authors extractor is looping through all these keywords in html source code and matching them if it is written by these pattern, is that right? I also saw there is a method 2 there which is more intuitive (for me?). Can you explain a little bit more so maybe I can fix this?
Issue by WalterGR Tue Feb 13 17:52:01 2018 Originally opened as https://github.com/codelucas/newspaper/issues/519
I've tried these 3 URLs:
Transcript:
I'm a software engineer so I can potentially fix this, if you could provide a few pointers on where to look in the code.
Thanks.