ERROR:newsplease.crawler.commoncrawl_extractor:Document is empty
2018-06-16 14:23:49 [newsplease.crawler.commoncrawl_extractor] ERROR: Document is empty
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ubuntu/news-please/newsplease/examples/commoncrawl_hai.py", line 160, in <module>
continue_process=True)
File "/home/ubuntu/news-please/newsplease/crawler/commoncrawl_crawler.py", line 227, in crawl_from_commoncrawl
log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs)
File "/home/ubuntu/news-please/newsplease/crawler/commoncrawl_crawler.py", line 147, in __start_commoncrawl_extractor
log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs)
File "/home/ubuntu/news-please/newsplease/crawler/commoncrawl_extractor.py", line 334, in extract_from_commoncrawl
self.__run()
File "/home/ubuntu/news-please/newsplease/crawler/commoncrawl_extractor.py", line 292, in __run
self.__process_warc_gz_file(local_path_name)
File "/home/ubuntu/news-please/newsplease/crawler/commoncrawl_extractor.py", line 247, in __process_warc_gz_file
self.__logger.info('article pass (%s; %s; %s)', article.source_domain, article.date_publish,
AttributeError: 'NoneType' object has no attribute 'source_domain'