allegroai / clearml-server

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Other
381 stars 132 forks source link

Feature request: Display errors encountered by the experiment in the UI #54

Open ranrubin opened 4 years ago

ranrubin commented 4 years ago

Feature request

When running an experiment code, the UI displayed an error from the server (500), but with no details regarding the cause. After exploring the logs I found out that the fileserver crashed because the file name was too long.

I would love a way for the UI to clearly display all kinds of errors encountered by the experiment (including, but not limited to, file names being too long...)

The solution I would like

I would rather get a message in the UI saying that the file name is too big, rather than have to look for the issue in the logs

Additional context

Logs from /opt/trains/logs/fileserver.log

[2020-07-12 14:22:04,977] [7] [ERROR] [fileserver] Exception on / [POST]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.6/site-packages/flask_cors/extension.py", line 161, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.6/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "fileserver.py", line 32, in upload
    file.save(str(target))
  File "/usr/local/lib/python3.6/site-packages/werkzeug/datastructures.py", line 3066, in save
    dst = open(dst, "wb")
OSError: [Errno 36] File name too long: '/mnt/fileserver/trains/14-trains.bd72a5a2afdhsy2aa0acc3dca21b9b5f/metrics/Evaluator CV_no_my_real_name_no_my_real_name_no_my_real_name_len_512_Jul12_14-20-42_merge__no_my_real_nameh_sz_8__no_my_real_name_sz_8_lr_1e-06_w_decay_0.0_warm_up_50_in_sz_768_hid_sz_256_word_aug_p_0.0_no_my_real_name_1__no_my_real_name/_no_my_real_name _no_my_real_name_no_my_real_namele layer__no_my_real_namelen_512_Jul12_14-20-42__no_my_real_name_batch_sz_8_test_batch_sz_8_lr_1e-06_w_d_no_my_real_name0_in_sz_768_hid__no_my_real_name_0.0_word_aug_min_1_imba_no_my_real_name__no_my_real_name_00000000.jpeg'
bmartinn commented 4 years ago

Hi @ranrubin

The original bug (#49) is actually a bug in Trains (even though the manifestation is in the trains-server) . The bug is, Trains will try to create links that the file storage might not support (basically there is a filename length limit, e.g. s3 object storage has its own limits, and shared filesystem as well). A fix will be deployed in the next RC (due in a few days).

But regardless of the original bug, are you suggesting a per Task section capturing the stderr, for easier readability? Or, are you saying it will be nice to get the trains-server log in the UI?

ranrubin commented 4 years ago

Hi @bmartinn, thanks for commenting. I'm not familiar enough with trains to define what I mean as well as you defined the situation in your comment. Putting it simply, I would say that - as a user of the UI, when an error occurs, I want to see the exact reason for the failure rather than a generic "500" message.