ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.35k stars 134 forks source link

Log gets spammed with UTF-8 encode errors regarding unpaired surrogates #73

Open ethus3h opened 8 years ago

ethus3h commented 8 years ago

Hi, I'm getting a lot of this error in my log, apparently because the status messages an FTP server returns contain Unicode surrogates. It makes it hard to read. Could this error be muted? Thanks :)

-- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.4/logging/__init__.py", line 980, in emit
    stream.write(msg)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf0' in position 104: surrogates not allowed
Call stack:
  File "/home/grabbot/.local/bin/grab-site", line 4, in <module>
    main.main()
  File "/home/grabbot/.local/lib/python3.4/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/grabbot/.local/lib/python3.4/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/grabbot/.local/lib/python3.4/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/grabbot/.local/lib/python3.4/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/grabbot/.local/lib/python3.4/site-packages/libgrabsite/main.py", line 307, in main
    wpull.__main__.main()
  File "/home/grabbot/.local/lib/python3.4/site-packages/wpull/__main__.py", line 40, in main
    exit_code = application.run_sync()
  File "/home/grabbot/.local/lib/python3.4/site-packages/wpull/app.py", line 118, in run_sync
    return self._event_loop.run_until_complete(self.run())
  File "/home/grabbot/.local/lib/python3.4/site-packages/trollius/base_events.py", line 338, in run_until_complete
    self.run_forever()
  File "/home/grabbot/.local/lib/python3.4/site-packages/trollius/base_events.py", line 309, in run_forever
    self._run_once()
  File "/home/grabbot/.local/lib/python3.4/site-packages/trollius/base_events.py", line 1217, in _run_once
    handle._run()
  File "/home/grabbot/.local/lib/python3.4/site-packages/trollius/events.py", line 136, in _run
    self._callback(*self._args)
  File "/home/grabbot/.local/lib/python3.4/site-packages/trollius/tasks.py", line 338, in _wakeup
    self._step(value, None)
  File "/home/grabbot/.local/lib/python3.4/site-packages/trollius/tasks.py", line 252, in _step
    result = coro.send(value)
  File "/home/grabbot/.local/lib/python3.4/site-packages/wpull/processor/ftp.py", line 333, in _fetch
    self._log_response(request, response)
  File "/home/grabbot/.local/lib/python3.4/site-packages/wpull/processor/ftp.py", line 403, in _log_response
    content_length=response.body.size(),
Message: <wpull.backport.logging.BraceMessage object at 0x7f05f4391780>
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.4/logging/__init__.py", line 980, in emit
    stream.write(msg)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf0' in position 104: surrogates not allowed
Call stack:
  File "/home/grabbot/.local/bin/grab-site", line 4, in <module>
    main.main()
  File "/home/grabbot/.local/lib/python3.4/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/grabbot/.local/lib/python3.4/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/grabbot/.local/lib/python3.4/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/grabbot/.local/lib/python3.4/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/grabbot/.local/lib/python3.4/site-packages/libgrabsite/main.py", line 307, in main
    wpull.__main__.main()
  File "/home/grabbot/.local/lib/python3.4/site-packages/wpull/__main__.py", line 40, in main
    exit_code = application.run_sync()
  File "/home/grabbot/.local/lib/python3.4/site-packages/wpull/app.py", line 118, in run_sync
    return self._event_loop.run_until_complete(self.run())
  File "/home/grabbot/.local/lib/python3.4/site-packages/trollius/base_events.py", line 338, in run_until_complete
    self.run_forever()
  File "/home/grabbot/.local/lib/python3.4/site-packages/trollius/base_events.py", line 309, in run_forever
    self._run_once()
  File "/home/grabbot/.local/lib/python3.4/site-packages/trollius/base_events.py", line 1217, in _run_once
    handle._run()
  File "/home/grabbot/.local/lib/python3.4/site-packages/trollius/events.py", line 136, in _run
    self._callback(*self._args)
  File "/home/grabbot/.local/lib/python3.4/site-packages/trollius/tasks.py", line 338, in _wakeup
    self._step(value, None)
  File "/home/grabbot/.local/lib/python3.4/site-packages/trollius/tasks.py", line 252, in _step
    result = coro.send(value)
  File "/home/grabbot/.local/lib/python3.4/site-packages/wpull/processor/ftp.py", line 333, in _fetch
    self._log_response(request, response)
  File "/home/grabbot/.local/lib/python3.4/site-packages/wpull/processor/ftp.py", line 403, in _log_response
    content_length=response.body.size(),
Message: <wpull.backport.logging.BraceMessage object at 0x7f05f4391780>
Arguments: ()
ivan commented 8 years ago

This should be filed on wpull. The FTP URL would be helpful as well.

ethus3h commented 8 years ago

Filed as https://github.com/chfoo/wpull/issues/310 — the FTP is the same as in #74, ftp://91.193.237.1/ :)