LuChang-CS / news-crawler

A news crawler for BBC News, Reuters and New York Times.
108 stars 40 forks source link

IndexError: string index out of range #9

Closed penguinwang96825 closed 3 years ago

penguinwang96825 commented 3 years ago

Issue with string index out of range when trying to fetch BBC news.

Settings:

  1. start_date=2015-07-01
  2. end_date=2021-05-16
  3. path=./dataset/bbc/
2015-07-01 1 in 2146 dates
fetching download links...
Traceback (most recent call last):
  File "bbc_crawler.py", line 15, in <module>
    bbc_article_fetcher.fetch()
  File "C:\Users\Yang\Desktop\Dissertation\crawler\news-crawler\article\darticle.py", line 139, in fetch
    links = self.download_link_fetcher.fetch(api_url)
  File "C:\Users\Yang\Desktop\Dissertation\crawler\news-crawler\link\dlink.py", line 59, in fetch
    links = self._html_to_links(html)
  File "C:\Users\Yang\Desktop\Dissertation\crawler\news-crawler\link\bbc_link.py", line 35, in _html_to_links
    link = self._format_link(element['href'])
  File "C:\Users\Yang\Desktop\Dissertation\crawler\news-crawler\link\dlink.py", line 22, in _format_link
    if link[-1] == '/':
IndexError: string index out of range
LuChang-CS commented 3 years ago

I think it is because that the archive site of BBC news has updated their website structure. So I also made an update in bbc_link.py to fetch BBC news links:

elements = soup.table.find_all('a', class_='title-link')

You can clone this project again. I also added a filter video_and_audio to BBC links because links containing it got a 404 response code. Please make sure it is also what you expect. If not, just modifying this file in your local repo.