ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.31k stars 129 forks source link

Crash on EOFError: Compressed file ended before the end-of-stream marker was reached #151

Open ivan opened 5 years ago

ivan commented 5 years ago
grab-site https://fractalforums.org/ --igsets=forums

results in

https://fractalforums.org/index.php?action=sitemap;b=32 ...
https://fractalforums.org/index.php?action=sitemap;b=67 ...
https://fractalforums.org/index.php?action=sitemap;b=31 ...
https://fractalforums.org/index.php?action=sitemap;b=18 ...
https://fractalforums.org/index.php?action=sitemap;b=20 ...
https://fractalforums.org/index.php?action=sitemap;b=41 ...
https://fractalforums.org/index.php?action=sitemap;b=65 ...
https://fractalforums.org/index.php?action=sitemap;b=10 ...
https://fractalforums.org/index.php?action=sitemap;b=13 ...
https://fractalforums.org/index.php?action=sitemap;b=46 ...
https://fractalforums.org/index.php?action=sitemap;b=59 ...
https://fractalforums.org/index.php?action=sitemap;b=56 ...
https://fractalforums.org/index.php?action=sitemap;b=2 ...
https://fractalforums.org/index.php?action=sitemap;b=60 ...
https://fractalforums.org/index.php?action=sitemap;b=42 ...
https://fractalforums.org/index.php?action=sitemap;b=63 ...
https://fractalforums.org/index.php?action=sitemap;b=64 ...
https://fractalforums.org/index.php?action=sitemap;b=48 ...
https://fractalforums.org/index.php?action=sitemap;b=4 ...
https://fractalforums.org/index.php?action=sitemap;b=53 ...
https://fractalforums.org/index.php?action=sitemap;b=54 ...
https://fractalforums.org/index.php?action=sitemap;b=21 ...
https://fractalforums.org/index.php?action=sitemap;b=25 ...
https://fractalforums.org/index.php?action=sitemap;b=61 ...
https://fractalforums.org/index.php?action=sitemap;b=29 ...
https://fractalforums.org/index.php?action=sitemap;b=19 ...
https://fractalforums.org/index.php?action=sitemap;b=73 ...
https://fractalforums.org/index.php?action=sitemap;b=28 ...
https://fractalforums.org/index.php?action=sitemap;b=78 ...
https://fractalforums.org/index.php?action=sitemap;b=52 ...
https://fractalforums.org/index.php?action=sitemap;xml ...
ERROR Fatal exception.
Traceback (most recent call last):
  File "/nix/store/2hh33a0dfn4isr3a2bw30zfrhs4diq1a-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/application/app.py", line 157, in run
    yield from pipeline.process()
  File "/nix/store/2hh33a0dfn4isr3a2bw30zfrhs4diq1a-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/pipeline/pipeline.py", line 194, in process
    yield from self._process_one_worker()
  File "/nix/store/2hh33a0dfn4isr3a2bw30zfrhs4diq1a-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/pipeline/pipeline.py", line 215, in _process_one_worker
    task.result()
  File "/nix/store/2hh33a0dfn4isr3a2bw30zfrhs4diq1a-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/pipeline/pipeline.py", line 119, in process
    item = yield from self.process_one(_worker_id=worker_id)
  File "/nix/store/2hh33a0dfn4isr3a2bw30zfrhs4diq1a-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/pipeline/pipeline.py", line 103, in process_one
    yield from task.process(item)
  File "/nix/store/2hh33a0dfn4isr3a2bw30zfrhs4diq1a-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/application/tasks/download.py", line 421, in process
    yield from session.app_session.factory['Processor'].process(session)
  File "/nix/store/2hh33a0dfn4isr3a2bw30zfrhs4diq1a-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/processor/delegate.py", line 29, in process
    return (yield from processor.process(item_session))
  File "/nix/store/2hh33a0dfn4isr3a2bw30zfrhs4diq1a-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/processor/web.py", line 91, in process
    return (yield from session.process())
  File "/nix/store/2hh33a0dfn4isr3a2bw30zfrhs4diq1a-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/processor/web.py", line 185, in process
    yield from self._process_loop()
  File "/nix/store/2hh33a0dfn4isr3a2bw30zfrhs4diq1a-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/processor/web.py", line 244, in _process_loop
    exit_early, wait_time = yield from self._fetch_one(cast(Request, self._item_session.request))
  File "/nix/store/2hh33a0dfn4isr3a2bw30zfrhs4diq1a-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/processor/web.py", line 308, in _fetch_one
    action = self._handle_response(request, response)
  File "/nix/store/2hh33a0dfn4isr3a2bw30zfrhs4diq1a-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/processor/web.py", line 423, in _handle_response
    self._processing_rule.scrape_document(self._item_session)
  File "/nix/store/mp2zsaazz2l1mrvkp7pzygn3xwbdr96s-grab-site-2.1.15/lib/python3.7/site-packages/libgrabsite/wpull_tweaks.py", line 55, in scrape_document
    super().scrape_document(item_session)
  File "/nix/store/2hh33a0dfn4isr3a2bw30zfrhs4diq1a-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/processor/rule.py", line 527, in scrape_document
    item_session.url_record.link_type
  File "/nix/store/2hh33a0dfn4isr3a2bw30zfrhs4diq1a-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/scraper/base.py", line 186, in scrape_info
    scrape_result = scraper.scrape(request, response, link_type)
  File "/nix/store/2hh33a0dfn4isr3a2bw30zfrhs4diq1a-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/scraper/sitemap.py", line 37, in scrape
    for link in link_iter:
  File "/nix/store/2hh33a0dfn4isr3a2bw30zfrhs4diq1a-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/scraper/base.py", line 150, in iter_processed_links
    for link in self.iter_links(file, encoding):
  File "/nix/store/2hh33a0dfn4isr3a2bw30zfrhs4diq1a-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/document/sitemap.py", line 69, in iter_links
    for html_obj in self._html_parser.parse(file, encoding):
  File "/nix/store/2hh33a0dfn4isr3a2bw30zfrhs4diq1a-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/document/htmlparse/lxml_.py", line 25, in parse
    parser_type=parser_type):
  File "/nix/store/2hh33a0dfn4isr3a2bw30zfrhs4diq1a-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/document/htmlparse/lxml_.py", line 53, in parse_html
    tree = lxml.etree.parse(file, parser=parser)
  File "src/lxml/etree.pyx", line 3424, in lxml.etree.parse
  File "src/lxml/parser.pxi", line 1861, in lxml.etree._parseDocument
  File "src/lxml/parser.pxi", line 1881, in lxml.etree._parseFilelikeDocument
  File "src/lxml/parser.pxi", line 1776, in lxml.etree._parseDocFromFilelike
  File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
  File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 707, in lxml.etree._handleParseResult
  File "src/lxml/etree.pyx", line 314, in lxml.etree._ExceptionContext._raise_if_stored
  File "src/lxml/parser.pxi", line 370, in lxml.etree._FileReaderContext.copyToBuffer
  File "/nix/store/kimimvhimclyfzlncpg36zjni3wn70nq-python3-3.7.3/lib/python3.7/gzip.py", line 276, in read
    return self._buffer.read(size)
  File "/nix/store/kimimvhimclyfzlncpg36zjni3wn70nq-python3-3.7.3/lib/python3.7/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/nix/store/kimimvhimclyfzlncpg36zjni3wn70nq-python3-3.7.3/lib/python3.7/gzip.py", line 454, in read
    self._read_eof()
  File "/nix/store/kimimvhimclyfzlncpg36zjni3wn70nq-python3-3.7.3/lib/python3.7/gzip.py", line 498, in _read_eof
    crc32, isize = struct.unpack("<II", self._read_exact(8))
  File "/nix/store/kimimvhimclyfzlncpg36zjni3wn70nq-python3-3.7.3/lib/python3.7/gzip.py", line 400, in _read_exact
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
CRITICAL Sorry, Wpull unexpectedly crashed.