digitalmethodsinitiative / dmi-scheduler

A lightweight job queue & scheduler
MIT License
1 stars 0 forks source link

`self.work()` exception clause being raised to `threading.py` instead of being handled #1

Open dale-wahl opened 3 years ago

dale-wahl commented 3 years ago

Theory: custom exceptions not being caught properly in the program itself? If that's the case, then is it a dmi-scheduler issue? The real problem is that any sort of logging is not handled and the job is not properly closed out by the dmi-scheduler.

Possible solution: at the very least we could ensure that our base logger also handles the root log so that logs are not just send to stderr/stdout and stored in the log file. This would not resolve the error handling/job closure.

Two examples where exceptions are being raised to _bootstrap_inner in threading.py.

Exception in thread crawler:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/opt/middleman-python/dmi_scheduler/worker.py", line 57, in run
    self.work()
  File "/opt/middleman-python/workers/crawl.py", line 46, in work
    network_pending.iterate()
  File "/opt/middleman-python/lib/network.py", line 222, in iterate
    new_corpus.add_LRUs(starting_points)
  File "/opt/middleman-python/lib/corpus.py", line 283, in add_LRUs
    self.add_URLs([hyphe.LRU_to_URL(lru) for lru in lrus])
  File "/opt/middleman-python/lib/corpus.py", line 241, in add_URLs
    self.add_URL(url)
  File "/opt/middleman-python/lib/corpus.py", line 255, in add_URL
    response = hyphe.call("store.declare_webentity_by_lrus_as_urls", [url], None, None, [], False, self.corpus_id)
  File "/opt/middleman-python/lib/hyphe.py", line 90, in call
    raise HypheException("Unexpected response format\n\n%s" % repr(response))
lib.exceptions.HypheException: Unexpected response format

{'jsonrpc': '2.0', 'id': '0dbec3f9-ed8e-4afc-a262-90057e3a6cb3', 'error': {'message': '', 'code': 8002, 'data': ''}}

or

Exception in thread crawler:                                                                                                                                                               
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner                                                                                                                  
  self.run()
  File "/opt/middleman-python/dmi_scheduler/worker.py", line 57, in run                                                                                                                        
  self.work()
File "/opt/middleman-python/workers/crawl.py", line 46, in work
  network_pending.iterate()
File "/opt/middleman-python/lib/network.py", line 216, in iterate
  new_corpus = lib.corpus.Corpus.create(self.get_corpus_ID(next_iteration), self.local_db, self.remote_db, detail_level, self.depth)
File "/opt/middleman-python/lib/corpus.py", line 656, in create
  raise CorpusException("Could not create corpus %s: %s" % (name, repr(created)))
lib.exceptions.CorpusException: Could not create corpus ic-2-356267: {'message': 'There already is a corpus named "ic-2-356267".', 'code': 'fail'}
dale-wahl commented 3 years ago

In [middleman-python](), I changed our base exception class from

class MiddlemanException(BaseException):
    pass

to

class MiddlemanException(Exception):
    pass

and a CorpusException error was caught.

2021-11-03 12:00:55,960 [ERROR]  Worker crawler raised exception CorpusException and will abort: Could not create corpus ic-2-356269: {'message': 'There already is a corpus named "ic-2-356269".', 'code': 'fail'} at worker.py:57->crawl.py:46->network.py:216->corpus.py:656

But it has always been BaseException so there must be another reason. At some point I switched self.log.error() to self.log.exception() in the Exception block after self.run() from the dmi-scheduler worker and then back again. Perhaps that was the culprit. I switched back to BaseException to test that, but it failed to catch the error again (while using self.log.error() per normal).

This seems fixed, but I do not know why it started to begin with.