LuChang-CS / news-crawler

A news crawler for BBC News, Reuters and New York Times.
108 stars 40 forks source link

unable to crawl news on Reuters and NYT #13

Closed chz816 closed 3 years ago

chz816 commented 3 years ago

Hi, thank you for your great contribution on this interesting project! I noticed that I am not able to craw news on Reuters and NYT (BBC works great for me).

For NYT:

USER$ python nytimes_crawler.py settings/nytimes.cfg
fetching new years links
Traceback (most recent call last):
  File "nytimes_crawler.py", line 65, in <module>
    nytime_article_fetcher.fetch()
  File "/Users/rachelzheng/Documents/news-crawler/article/darticle.py", line 132, in fetch
    api_url, date = self.download_link_fetcher.next()
  File "/Users/rachelzheng/Documents/news-crawler/link/nytimes_link.py", line 44, in next
    api_url = self._next_api(self.base_api_url, self.current_date)
  File "/Users/rachelzheng/Documents/news-crawler/link/nytimes_link.py", line 39, in _next_api
    return self.month_links[current_date.month - 1]
IndexError: list index out of range

For Reuters:

USER$ python reuters_crawler.py settings/reuters.cfg
2016-12-31 1 in 1 dates                  
fetching download links: https://uk.reuters.com/resources/archive/uk/20161231.html
api https://uk.reuters.com/resources/archive/uk/20161231.html  failed
2016-12-31 date 1 finished 
LuChang-CS commented 3 years ago

Hi Rachel, thanks for your interest. NYT has updated its website structure so that the previous code version did not work. I have made two commits to fit their latest website. Please feel free to re-clone the code.

For Reuters, they have disabled their original archive website. The new website https://www.reuters.com/news/archive has only a limited number of historical articles, so I did not update codes for Reuters anymore.

chz816 commented 3 years ago

Thank you for your contribution! The problem for NYT is fixed.