ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.31k stars 129 forks source link

wpull crash when http_proxy is set #148

Open yi opened 5 years ago

yi commented 5 years ago

grab-site suddenly stop workings and no long work since then.

I've tried uninstall-then-reinstall wpull and grab-site. But still not working.

Cry for help, please!

Traceback (most recent call last):
  File "/Users/aaa/gs-venv/lib/python3.7/site-packages/wpull/application/app.py", line 157, in run
    yield from pipeline.process()
  File "/Users/aaa/gs-venv/lib/python3.7/site-packages/wpull/pipeline/pipeline.py", line 194, in process
    yield from self._process_one_worker()
  File "/Users/aaa/gs-venv/lib/python3.7/site-packages/wpull/pipeline/pipeline.py", line 215, in _process_one_worker
    task.result()
  File "/Users/aaa/gs-venv/lib/python3.7/site-packages/wpull/pipeline/pipeline.py", line 119, in process
    item = yield from self.process_one(_worker_id=worker_id)
  File "/Users/aaa/gs-venv/lib/python3.7/site-packages/wpull/pipeline/pipeline.py", line 103, in process_one
    yield from task.process(item)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/coroutines.py", line 120, in coro
    res = func(*args, **kw)
  File "/Users/aaa/gs-venv/lib/python3.7/site-packages/wpull/application/tasks/network.py", line 21, in process
    self._build_connection_pool(session)
  File "/Users/aaa/gs-venv/lib/python3.7/site-packages/wpull/application/tasks/network.py", line 85, in _build_connection_pool
    http_proxy = session.args.http_proxy.split(':', 1)
AttributeError: 'NoneType' object has no attribute 'split'
CRITICAL Sorry, Wpull unexpectedly crashed.
Disconnected from ws:// server: RuntimeError('Event loop is closed')
Exception ignored in: <coroutine object sender at 0x10e9d7ac8>
Traceback (most recent call last):
  File "/Users/aaa/gs-venv/lib/python3.7/site-packages/libgrabsite/dashboard_client.py", line 54, in sender
    await asyncio.sleep(delay)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/tasks.py", line 566, in sleep
    future, result)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py", line 657, in call_later
    context=context)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py", line 667, in call_at
    self._check_closed()
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py", line 480, in _check_closed
    raise RuntimeError('Event loop is closed')
ivan commented 5 years ago

That http_proxy = session.args.http_proxy.split(':', 1) makes me think something set an environmental variable to use an HTTP proxy.

Try env | grep -i proxy and maybe unset the variable?

Please let me know if it's not that.

codsane commented 4 years ago

I'm receiving an identical error after setting wpull's proxy using --wpull-args="--http-proxy=0.0.0.0:16379"

Unfortunately env | grep -i proxy doesn't seem to return anything, and I've even made sure to run it within the container that grab-site is running in.

Even after removing --wpull-args, grab-site seems to be crashing with the same event loop error when attempting to crawl. In my case I was able to reinstall grab-site to fix this. I've even switched to dockerized grab-site, to make it easier to spin up fresh environments for testing.

As I'd like to eventually bring full onion archive capabilities to grab-site, I have decided to go ahead and make sure my wget onion archive configuration is able to be ported to wpull first.

I've opened an issue to address my personal issues using proxies in wpull. Assuming I can get that stuff cleared up, I will take another look at the proxy issues we're receiving in grab-site. grab-site appears to run a fork of wpull, so I'm wondering if our proxy issue may be specific to the fork of wpull or the plugins that grab-site introduces.