ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.35k stars 134 forks source link

WSL: lmdb.CorruptedError: mdb_get: MDB_CORRUPTED: Located page was wrong type #158

Closed menmob closed 5 years ago

menmob commented 5 years ago

Running Ubuntu on WSL, I am getting this error for any website I try.

Imported /mnt/c/Windows/system32/scp-wiki.net-2019-09-03-1177b379/ignores
Connected to ws://127.0.0.1:29000
Imported /mnt/c/Windows/system32/scp-wiki.net-2019-09-03-1177b379/max_content_length
302 Found http://scp-wiki.net/
Imported /mnt/c/Windows/system32/scp-wiki.net-2019-09-03-1177b379/delay
Imported /mnt/c/Windows/system32/scp-wiki.net-2019-09-03-1177b379/concurrency
200 OK http://www.scp-wiki.net/
/home/britmob/gs-venv/lib/python3.7/site-packages/wpull/protocol/http/client.py:185: UserWarning: HTTP session did not complete.
  warnings.warn(_('HTTP session did not complete.'))
302 Found http://scp-wiki.net/robots.txt
ERROR Fatal exception.
Traceback (most recent call last):
  File "/home/britmob/gs-venv/lib/python3.7/site-packages/wpull/application/app.py", line 157, in run
    yield from pipeline.process()
  File "/home/britmob/gs-venv/lib/python3.7/site-packages/wpull/pipeline/pipeline.py", line 194, in process
    yield from self._process_one_worker()
  File "/home/britmob/gs-venv/lib/python3.7/site-packages/wpull/pipeline/pipeline.py", line 215, in _process_one_worker
    task.result()
  File "/home/britmob/gs-venv/lib/python3.7/site-packages/wpull/pipeline/pipeline.py", line 119, in process
    item = yield from self.process_one(_worker_id=worker_id)
  File "/home/britmob/gs-venv/lib/python3.7/site-packages/wpull/pipeline/pipeline.py", line 103, in process_one
    yield from task.process(item)
  File "/home/britmob/gs-venv/lib/python3.7/site-packages/wpull/application/tasks/download.py", line 421, in process
    yield from session.app_session.factory['Processor'].process(session)
  File "/home/britmob/gs-venv/lib/python3.7/site-packages/wpull/processor/delegate.py", line 29, in process
    return (yield from processor.process(item_session))
  File "/home/britmob/gs-venv/lib/python3.7/site-packages/wpull/processor/web.py", line 91, in process
    return (yield from session.process())
  File "/home/britmob/gs-venv/lib/python3.7/site-packages/wpull/processor/web.py", line 185, in process
    yield from self._process_loop()
  File "/home/britmob/gs-venv/lib/python3.7/site-packages/wpull/processor/web.py", line 244, in _process_loop
    exit_early, wait_time = yield from self._fetch_one(cast(Request, self._item_session.request))
  File "/home/britmob/gs-venv/lib/python3.7/site-packages/wpull/processor/web.py", line 308, in _fetch_one
    action = self._handle_response(request, response)
  File "/home/britmob/gs-venv/lib/python3.7/site-packages/wpull/processor/web.py", line 423, in _handle_response
    self._processing_rule.scrape_document(self._item_session)
  File "/home/britmob/gs-venv/lib/python3.7/site-packages/libgrabsite/wpull_tweaks.py", line 43, in scrape_document
    dupe_of = dupes_db.get_old_url(digest)
  File "/home/britmob/gs-venv/lib/python3.7/site-packages/libgrabsite/dupes.py", line 37, in get_old_url
    maybe_url = txn.get(digest)
lmdb.CorruptedError: mdb_get: MDB_CORRUPTED: Located page was wrong type
CRITICAL Sorry, Wpull unexpectedly crashed.

https://i.imgur.com/2pmKXvL.png

I assume this is due to WSL.. anything I should try?

ivan commented 5 years ago

Yeah, WSL has some problems and will get replaced with WSL2 soon.

You could run grab-site with --no-dupespotter, which will avoid using lmdb.

menmob commented 5 years ago

That worked, thank you.

On Sep 2, 2019, at 10:05 PM, Ivan Kozik notifications@github.com wrote:

Yeah, WSL has some problems and will get replaced with WSL2 soon.

You could run grab-site with --no-dupespotter, which will avoid using lmdb.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.