Better user-facing error messages when a Rodan job fails

timothydereuse commented 1 year ago

Currently when a job fails, the only feedback the user gets is a Python error traceback. In addition to this traceback, there should be some kind of message that the job can pass back to the client, somehow, that the user can see.

For example - if the staff-finding job is capable of failing because it's given a blank image, there should be a check for that in the job itself, and when it throws an exception the message from the exception should be delivered back to the user in a friendlier and easier-to-read way than just the python stack trace. Hopefully this could lead to people being able to figure out why a job went wrong on their own even if they do not know how to interpret the error message.

malajvan commented 1 year ago

I am looking into creating checks for specific jobs and would like to know what are the common errors that you get when working on Rodan. As people who frequently work on E2E Rodan workflows, can you list some common issues that you would like to see specific messages for (ie. wrong file input, empty pages, etc) @JoyfulGen @martha-thomae

martha-thomae commented 1 year ago

I am trying my best to remember some of these issues. Right now I can only think of one:

[x] At the very beginning in the Image Resize job: The resize ratio should not be greater than 1. In one case, a user tried to resize by a ratio greater than 1 to achieve the inter-staff distance of 64 pixels. In that case they didn't need to resize. Resizing by a ratio greater than 1 would cause cause that job to never end (so, it didn't failed, just never finished). But this maybe can be handled in a different way, like adding that information in the settings (value lower or equal to 1) and making the job fail when the ratio is greater than 1.

I think that when I get to test the end-to-end OMR workflow again, I will think of more things besides this and what Tim already said about the staff processing.

JoyfulGen commented 1 year ago

I've been noodling around with simple mistakes I can imagine new or distracted users might make, and here's what I've got so far:

Mistake: Assigning the same model to two different input ports in the Fast Pixelwise Analysis job. Result: MEI_encoding job fails, with the following traceback error:

  File "/usr/local/lib/python3.7/site-packages/celery/app/trace.py", line 412, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/celery/app/trace.py", line 704, in __protected_call__
    return self.run(*args, **kwargs)
  File "/code/Rodan/rodan/jobs/base.py", line 773, in run
    retval = self.run_my_task(inputs, settings, arg_outputs)
  File "/code/Rodan/rodan/jobs/MEI_encoding/MEI_encoding.py", line 85, in run_my_task
    mei_string = bm.process(jsomr, syls, classifier_table, width_mult, width_container)
  File "/code/Rodan/rodan/jobs/MEI_encoding/build_mei_file.py", line 718, in process
    meiDoc = build_mei(pairs, classifier, width_container, jsomr['staves'], jsomr['page'])
  File "/code/Rodan/rodan/jobs/MEI_encoding/build_mei_file.py", line 503, in build_mei
    bb = staves[0]['bounding_box']
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/celery/app/trace.py", line 429, in trace_task
    I, R, state, retval = on_error(task_request, exc, uuid)
  File "/usr/local/lib/python3.7/site-packages/celery/app/trace.py", line 366, in on_error
    task, request, eager=eager, call_errbacks=call_errbacks,
  File "/usr/local/lib/python3.7/site-packages/celery/app/trace.py", line 173, in handle_error_state
    call_errbacks=call_errbacks)
  File "/usr/local/lib/python3.7/site-packages/celery/app/trace.py", line 221, in handle_failure
    task.on_failure(exc, req.id, req.args, req.kwargs, einfo)
  File "/code/Rodan/rodan/jobs/base.py", line 1015, in on_failure
    and user.user_preference.sned_email
AttributeError: 'UserPreference' object has no attribute 'sned_email'

Mistake: Assigning a previously generated symbol layer as the original folio image. Result: MEI_encoding job fails, with the following traceback error:

    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/celery/app/trace.py", line 704, in __protected_call__
    return self.run(*args, **kwargs)
  File "/code/Rodan/rodan/jobs/base.py", line 773, in run
    retval = self.run_my_task(inputs, settings, arg_outputs)
  File "/code/Rodan/rodan/jobs/MEI_encoding/MEI_encoding.py", line 85, in run_my_task
    mei_string = bm.process(jsomr, syls, classifier_table, width_mult, width_container)
  File "/code/Rodan/rodan/jobs/MEI_encoding/build_mei_file.py", line 718, in process
    meiDoc = build_mei(pairs, classifier, width_container, jsomr['staves'], jsomr['page'])
  File "/code/Rodan/rodan/jobs/MEI_encoding/build_mei_file.py", line 503, in build_mei
    bb = staves[0]['bounding_box']
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/celery/app/trace.py", line 429, in trace_task
    I, R, state, retval = on_error(task_request, exc, uuid)
  File "/usr/local/lib/python3.7/site-packages/celery/app/trace.py", line 366, in on_error
    task, request, eager=eager, call_errbacks=call_errbacks,
  File "/usr/local/lib/python3.7/site-packages/celery/app/trace.py", line 173, in handle_error_state
    call_errbacks=call_errbacks)
  File "/usr/local/lib/python3.7/site-packages/celery/app/trace.py", line 221, in handle_failure
    task.on_failure(exc, req.id, req.args, req.kwargs, einfo)
  File "/code/Rodan/rodan/jobs/base.py", line 1015, in on_failure
    and user.user_preference.sned_email
AttributeError: 'UserPreference' object has no attribute 'sned_email'

Mistake: In the NIC job, assigning the split_features file to the training data port and vice-versa. Result: NIC failed, with the following traceback error.

    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/celery/app/trace.py", line 704, in __protected_call__
    return self.run(*args, **kwargs)
  File "/code/Rodan/rodan/jobs/base.py", line 773, in run
    retval = self.run_my_task(inputs, settings, arg_outputs)
  File "/code/Rodan/rodan/jobs/gamera_rodan/wrappers/classification.py", line 91, in run_my_task
    cknn = gamera.knn.kNNNonInteractive(tempPath)
  File "/usr/local/lib/python3.7/site-packages/gamera/knn.py", line 686, in __init__
    classify.NonInteractiveClassifier.__init__(self, database, perform_splits)
  File "/usr/local/lib/python3.7/site-packages/gamera/classify.py", line 515, in __init__
    self.from_xml_filename(database)
  File "/usr/local/lib/python3.7/site-packages/gamera/classify.py", line 427, in from_xml_filename
    self._from_xml(stream)
  File "/usr/local/lib/python3.7/site-packages/gamera/classify.py", line 432, in _from_xml
    self.set_glyphs(database)
  File "/usr/local/lib/python3.7/site-packages/gamera/classify.py", line 558, in set_glyphs
    self.instantiate_from_images(self.database, self.normalize)
ValueError: Initial database of a non-interactive kNN classifier must have at least one element.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/celery/app/trace.py", line 429, in trace_task
    I, R, state, retval = on_error(task_request, exc, uuid)
  File "/usr/local/lib/python3.7/site-packages/celery/app/trace.py", line 366, in on_error
    task, request, eager=eager, call_errbacks=call_errbacks,
  File "/usr/local/lib/python3.7/site-packages/celery/app/trace.py", line 173, in handle_error_state
    call_errbacks=call_errbacks)
  File "/usr/local/lib/python3.7/site-packages/celery/app/trace.py", line 221, in handle_failure
    task.on_failure(exc, req.id, req.args, req.kwargs, einfo)
  File "/code/Rodan/rodan/jobs/base.py", line 1015, in on_failure
    and user.user_preference.sned_email
AttributeError: 'UserPreference' object has no attribute 'sned_email'

There are also mistakes the user can make that don't provoke a failure of the workflow, but mess up the final results. Let me know if those would be useful to know also!

sabrina0822 commented 1 year ago

This is mildly unrelated but I'm assuming the sned_email is a typo?

(in rodan-main/code/rodan/jobs/base.py)

sabrina0822 commented 1 year ago

@JoyfulGen if this is too much work do not worry about it at all, but for the first two errors, do you happen to have the MEI encoding inputs that cause the error outputs? Otherwise I can re-create it!

JoyfulGen commented 1 year ago

@sabrina0822 here they are:

Assigned the music_symbol model to both layer 1 and layer 2 input ports of the Fast_pixelwise job 129r_same_model_twice_PF.json.zip 129r_same_model_twice_TA.json.zip

Used a previously generated symbol layer as the original image using_symbol_layer_as_image_PF.json.zip using_symbol_layer_as_image_TA.json.zip

DDMAL / Rodan

Better user-facing error messages when a Rodan job fails #1041