fhamborg / news-please

news-please - an integrated web crawler and information extractor for news that just works
Apache License 2.0
2.04k stars 422 forks source link

AttributeError: 'NoneType' object has no attribute 'source_domain' in CommonCrawl example #63

Closed mauhai closed 6 years ago

mauhai commented 6 years ago
ERROR:newsplease.crawler.commoncrawl_extractor:Document is empty
2018-06-16 14:23:49 [newsplease.crawler.commoncrawl_extractor] ERROR: Document is empty
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/news-please/newsplease/examples/commoncrawl_hai.py", line 160, in <module>
    continue_process=True)
  File "/home/ubuntu/news-please/newsplease/crawler/commoncrawl_crawler.py", line 227, in crawl_from_commoncrawl
    log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs)
  File "/home/ubuntu/news-please/newsplease/crawler/commoncrawl_crawler.py", line 147, in __start_commoncrawl_extractor
    log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs)
  File "/home/ubuntu/news-please/newsplease/crawler/commoncrawl_extractor.py", line 334, in extract_from_commoncrawl
    self.__run()
  File "/home/ubuntu/news-please/newsplease/crawler/commoncrawl_extractor.py", line 292, in __run
    self.__process_warc_gz_file(local_path_name)
  File "/home/ubuntu/news-please/newsplease/crawler/commoncrawl_extractor.py", line 247, in __process_warc_gz_file
    self.__logger.info('article pass (%s; %s; %s)', article.source_domain, article.date_publish,
AttributeError: 'NoneType' object has no attribute 'source_domain'
fhamborg commented 6 years ago

closed, since required information for issues is missing (https://github.com/fhamborg/news-please/blob/master/README.md#issues). you may reopen the issue if you provide the required info, thanks