fhamborg / news-please

news-please - an integrated web crawler and information extractor for news that just works
Apache License 2.0
2.08k stars 428 forks source link

Error running commoncrawl.py #26

Closed IclickButtons closed 7 years ago

IclickButtons commented 7 years ago

After running commoncrawl.py for like 15min it throws following error:

DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): ads.civitasmedia.com
DEBUG:urllib3.connectionpool:http://ads.civitasmedia.com:80 "GET /w/1.0/ai?auid=465268&cs=517002e209b24&cb=18879 HTTP/1.1" 302 0
DEBUG:urllib3.connectionpool:http://ads.civitasmedia.com:80 "GET /w/1.0/ai?cc=1&auid=465268&cs=517002e209b24&cb=18879 HTTP/1.1" 302 0
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): u.openx.net
DEBUG:urllib3.connectionpool:http://u.openx.net:80 "GET /w/1.0/sc?r=http%3A%2F%2Fads.civitasmedia.com%2Fw%2F1.0%2Fai%3Fcc%3D1%26auid%3D465268%26cs%3D517002e209b24%26cb%3D18879 HTTP/1.1" 302 0
DEBUG:urllib3.connectionpool:http://u.openx.net:80 "GET /w/1.0/sc?cc=1&r=http%3A%2F%2Fads.civitasmedia.com%2Fw%2F1.0%2Fai%3Fcc%3D1%26auid%3D465268%26cs%3D517002e209b24%26cb%3D18879 HTTP/1.1" 302 0
DEBUG:urllib3.connectionpool:http://ads.civitasmedia.com:80 "GET /w/1.0/ai?mi=1bbd358a-0aa6-45e6-927b-7cc5cdbeab95&ma=1497001411&mr=1498211012&mn=1&mc=1&cc=1&auid=465268&cs=517002e209b24&cb=18879 HTTP/1.1" 200 43
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 11jo8z152kaa38lham19pzzv.wpengine.netdna-cdn.com
DEBUG:urllib3.connectionpool:http://11jo8z152kaa38lham19pzzv.wpengine.netdna-cdn.com:80 "GET /images/civitasreverse.png HTTP/1.1" 200 16957
INFO:__main__:article discard (sunburynews.com; None; Sunbury News)
INFO:__main__:statistics
INFO:__main__:pass = 0, discard = 160, total = 160
INFO:__main__:extraction from current WARC file started 10 minutes, 41 seconds ago; 4.012312 s/article
INFO:__main__:article discard (istoe.com.br; 2016-08-08 09:53:00; Olimp\xc3\xadada tem quebra de sete recordes mundiais)
INFO:__main__:article discard (brejo.com; 2013-12-22 00:00:00; FOTOS: Col\xc3\xa9gio da Luz realiza a 10\xc2\xaa edi\xc3\xa7\xc3\xa3o do Auto do Natal Luz)
Traceback (most recent call last):
  File "commoncrawl.py", line 271, in <module>
    common_crawl.run()
  File "commoncrawl.py", line 237, in run
    self.__process_warc_gz_file(local_path_name)
  File "commoncrawl.py", line 199, in __process_warc_gz_file
    article = NewsPlease.from_warc(record)
  File "/home/andreas/Envs/newsi/lib/python3.5/site-packages/newsplease/__init__.py", line 34, in from_warc
    article = NewsPlease.from_html(html, url)
  File "/home/andreas/Envs/newsi/lib/python3.5/site-packages/newsplease/__init__.py", line 68, in from_html
    item = extractor.extract(item)
  File "/home/andreas/Envs/newsi/lib/python3.5/site-packages/newsplease/pipeline/extractor/article_extractor.py", line 53, in extract
    article_candidates.append(extractor.extract(item))
  File "/home/andreas/Envs/newsi/lib/python3.5/site-packages/newsplease/pipeline/extractor/extractors/newspaper_extractor.py", line 30, in extract
    article.parse()
  File "/home/andreas/Envs/newsi/lib/python3.5/site-packages/newspaper/article.py", line 219, in parse
    meta_data = self.extractor.get_meta_data(self.clean_doc)
  File "/home/andreas/Envs/newsi/lib/python3.5/site-packages/newspaper/extractors.py", line 514, in get_meta_data
    ref[part] = value
TypeError: 'int' object does not support item assignment
fhamborg commented 7 years ago

Opened issue at newspaper repo https://github.com/codelucas/newspaper/issues/379

Will add a mode to the commoncrawl.py to ignore any error that might occur when processing all WARC files

fhamborg commented 7 years ago

commit 3c81c4392032f1715c2c03f98e3a1f3d95601107 adds an option that allows you to choose to continue in case of any error. just set continue_after_error = True in commoncrawl.py This should help until the issue is resolved in newspaper

codelucas commented 7 years ago

Thanks for creating this issue - I'll investigate on newspaper's end