chovanecm / sacredboard

Dashboard for sacred. Monitor and access your past machine learning experiments.
MIT License
184 stars 39 forks source link

Sorting issues / DataTables warning: table id=runs - Ajax error. #70

Open pinae opened 7 years ago

pinae commented 7 years ago

Sacredboard shows this error if I try to edit the filters:

DataTables warning: table id=runs - Ajax error. For more information about this error, please see http://datatables.net/tn/7

This seems to be a problem with missisng indexes in the mongodb (as far as I know). I originally got this error when starting sacredboard but after I created an index for start and end dates it only shows up when I change the filters.

I'm using the default settings for a mongodb installation on Ubuntu 17.04. Memory usage for sorting without an index seems to be limited to 32MB in this configuration.

If this is no bug in Sacredboard please add some documentation for the correct settings.

chovanecm commented 7 years ago

Hi Johannes, thanks for your report. I've never encountered such problem so just to make sure:

Thanks a lot.

pinae commented 7 years ago

Hi, before I created the indices the error occured every time. But I did not test that systematically because I thougt I did something wrong during the installation.

I can reproduce the error every time I change the sorting by clicking on "Experiment name", "Command" or "Hostname". I remember having the error when deactivating some of the statuses on Friday but I could not reproduce that today.

I tested with Firefox 54.0 and Chromium 59.0.3071.109 on Ubuntu 17.04.

There are only Errors for missing files and a Server error on the JavaScript console. Here are some screenshots:

bildschirmfoto von 2017-07-24 11-20-25 bildschirmfoto von 2017-07-24 11-22-23 bildschirmfoto von 2017-07-24 11-23-21

chovanecm commented 7 years ago

Thanks to your observation, I discovered another minor issue but that was probably not causing your problem. But I was unable to reproduce it. When you now upgrade to the latest sacredboard version (0.3.1), I think the issue persists. Nevertheless, when you run sacredboard -m your_db, the program should produce some output to the console, and I'm pretty sure there is a stack trace describing the cause of the problem. Could you please copy it for me?

Sorry for the inconvenience.

black-puppydog commented 7 years ago

Might this be related to stdout/stderr logging? for me, this happened when I had only a handful of runs stored, but with long outputs.

I get sth like this:

pymongo.errors.OperationFailure: database error: Runner error: Overflow sort stage buffered data usage of 33836427 bytes exceeds internal limit of 33554432 bytes

Which looks to be related to this

pinae commented 7 years ago

Sorry for my late reply. I get this error on the console:

[2017-08-24 15:49:10,529] ERROR in app: Exception on /api/run [GET]
Traceback (most recent call last):
  File "/home/jme/Code/LSTM-Classification-CPU/env/lib/python3.5/site-packages/flask/app.py", line 1982, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/jme/Code/LSTM-Classification-CPU/env/lib/python3.5/site-packages/flask/app.py", line 1614, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/jme/Code/LSTM-Classification-CPU/env/lib/python3.5/site-packages/flask/app.py", line 1517, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/home/jme/Code/LSTM-Classification-CPU/env/lib/python3.5/site-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/home/jme/Code/LSTM-Classification-CPU/env/lib/python3.5/site-packages/flask/app.py", line 1612, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/jme/Code/LSTM-Classification-CPU/env/lib/python3.5/site-packages/flask/app.py", line 1598, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/home/jme/Code/LSTM-Classification-CPU/env/lib/python3.5/site-packages/sacredboard/app/webapi/routes.py", line 41, in api_runs
    return get_runs()
  File "/home/jme/Code/LSTM-Classification-CPU/env/lib/python3.5/site-packages/sacredboard/app/webapi/runs.py", line 53, in get_runs
    recordsFiltered=records_filtered),
  File "/home/jme/Code/LSTM-Classification-CPU/env/lib/python3.5/site-packages/flask/templating.py", line 134, in render_template
    context, ctx.app)
  File "/home/jme/Code/LSTM-Classification-CPU/env/lib/python3.5/site-packages/flask/templating.py", line 116, in _render 
    rv = template.render(context)
  File "/home/jme/Code/LSTM-Classification-CPU/env/lib/python3.5/site-packages/jinja2/environment.py", line 1008, in render
    return self.environment.handle_exception(exc_info, True)
  File "/home/jme/Code/LSTM-Classification-CPU/env/lib/python3.5/site-packages/jinja2/environment.py", line 780, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/home/jme/Code/LSTM-Classification-CPU/env/lib/python3.5/site-packages/jinja2/_compat.py", line 37, in reraise
    raise value.with_traceback(tb)
  File "/home/jme/Code/LSTM-Classification-CPU/env/lib/python3.5/site-packages/sacredboard/templates/api/runs.js", line 7, in top-level template code
    {%- for run in runs -%}
  File "/home/jme/Code/LSTM-Classification-CPU/env/lib/python3.5/site-packages/jinja2/runtime.py", line 410, in __init__  
    self._after = self._safe_next()
  File "/home/jme/Code/LSTM-Classification-CPU/env/lib/python3.5/site-packages/jinja2/runtime.py", line 430, in _safe_next
    return next(self._iterator)
  File "/home/jme/Code/LSTM-Classification-CPU/env/lib/python3.5/site-packages/pymongo/cursor.py", line 1132, in next
    if len(self.__data) or self._refresh():
  File "/home/jme/Code/LSTM-Classification-CPU/env/lib/python3.5/site-packages/pymongo/cursor.py", line 1055, in _refresh 
    self.__collation))
  File "/home/jme/Code/LSTM-Classification-CPU/env/lib/python3.5/site-packages/pymongo/cursor.py", line 947, in __send_message
    helpers._check_command_response(doc['data'][0])
  File "/home/jme/Code/LSTM-Classification-CPU/env/lib/python3.5/site-packages/pymongo/helpers.py", line 210, in _check_command_response
    raise OperationFailure(msg % errmsg, code, response)
pymongo.errors.OperationFailure: Executor error during find command: OperationFailed: Sort operation used more than the maximum 33554432 bytes of RAM. Add an index, or specify a smaller limit.

My outputs are also pretty long because my code displays progress bars.

chovanecm commented 7 years ago

Thanks for posting the output! It really seems to be related to the issue that @black-puppydog posted. This needs further analysis to see how to handle the problem - whether Sacredboard should try to automatically add indices on the columns in the table, or just after the exception is thrown. I'll have a look at it after I finish the feature I have been working on recently (deleting experiments). I'm sorry for inconvenience until then.

trickmeyer commented 6 years ago

@pinae I'm pretty sure it's a CORS error. Check out the datatables.net link they provide and read about it some more for different solution. A quick test to see if this is the case is to check out the web app from a different computer on the same network (swapping 127.0.0.1 for xxx.x.x.xx for whatever your local IP is).

enricoschroeder commented 6 years ago

Hey there, I'm frequently stumbling upon this issue as well. I assume that it happens when the stored log output of the experiment is very long (e.g. training a model with Tensorflow for a couple of days).

Any updates on a potential fix?

pinae commented 6 years ago

@schroederen Try to add some indices. I added some and it fixed the problem. I missed to write down what exactly I did and realized after I reported the issue that it would have been beneficial.

enricoschroeder commented 6 years ago

@pinae Thanks for the reply. For what key did you create the indices? i.e. which parameters did you use for db.collections.createIndex()?

enricoschroeder commented 6 years ago

In the meantime, one can increase the limit of the search buffer, as described here. I Increased it to 50MB (from 30) and this fixes the issues I'm having. However, I expect to run out of buffer again eventually, so the thing with the indices might be a more elegant solution.

thomwolf commented 6 years ago

I created an index for each column in the board and it fixed the problem for me:

To create an index see https://docs.mongodb.com/manual/indexes/

enricoschroeder commented 6 years ago

Did this and it fixed the issue for a while, but it has returned. Additionally, I'm getting other wierd issues now: Some experiments stop to show up and also sorting by ID does not work correctly anymore. Maybe it wasn't such a good idea to create indices for all columns or I didn't do it correctly? :D

@chovanecm Is there any "official" fix incoming? Would be greatly appreciated!

anibali commented 6 years ago

For those that don't know how to create an index (like I didn't), you can use createIndex in the Mongo CLI. So to add a heartbeat index I did the following:

> use sacred
switched to db sacred
> db.runs.createIndex({ "heartbeat": -1 });
{
    "createdCollectionAutomatically" : false,
    "numIndexesBefore" : 1,
    "numIndexesAfter" : 2,
    "ok" : 1
}
chovanecm commented 6 years ago

I am considering letting Sacredboard automatically create indices for the displayed columns. I am just afraid of what happens if I implement #24 (adding custom columns).

enricoschroeder commented 6 years ago

I added indices for all possible entries, but I'm still getting this error on some columns. I've managed to get a trace from the console:

[2018-06-08 09:20:09,991] ERROR in app: Exception on /api/run [GET] Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 2292, in wsgi_app response = self.full_dispatch_request() File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1815, in full_dispatch_request rv = self.handle_user_exception(e) File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1718, in handle_user_exception reraise(exc_type, exc_value, tb) File "/usr/local/lib/python3.5/dist-packages/flask/_compat.py", line 35, in reraise raise value File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1813, in full_dispatch_request rv = self.dispatch_request() File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1799, in dispatch_request return self.view_functionsrule.endpoint File "/usr/local/lib/python3.5/dist-packages/sacredboard/app/webapi/runs.py", line 16, in api_runs return get_runs() File "/usr/local/lib/python3.5/dist-packages/sacredboard/app/webapi/runs.py", line 94, in get_runs recordsFiltered=records_filtered), File "/usr/local/lib/python3.5/dist-packages/flask/templating.py", line 135, in render_template context, ctx.app) File "/usr/local/lib/python3.5/dist-packages/flask/templating.py", line 117, in _render rv = template.render(context) File "/usr/local/lib/python3.5/dist-packages/jinja2/environment.py", line 1008, in render return self.environment.handle_exception(exc_info, True) File "/usr/local/lib/python3.5/dist-packages/jinja2/environment.py", line 780, in handle_exception reraise(exc_type, exc_value, tb) File "/usr/local/lib/python3.5/dist-packages/jinja2/_compat.py", line 37, in reraise raise value.with_traceback(tb) File "/usr/local/lib/python3.5/dist-packages/sacredboard/templates/api/runs.js", line 13, in top-level template code "is_alive": {{run.heartbeat | default | timediff | detect_alive_experiment | tojson }}, File "/usr/local/lib/python3.5/dist-packages/sacredboard/app/config/jinja_filters.py", line 28, in timediff diff = now - time TypeError: unsupported operand type(s) for -: 'datetime.datetime' and 'NoneType'

This looks like a different error, but it happens when trying to sort entries by some of the columns.

JarnoRFB commented 6 years ago

@anibali I had the same problem. Adding the index you described immediately resolved the problem. So probably letting sacredboard do this automatically is a good idea.

Leinadj commented 6 years ago

Same here, adding the indices worked like magic! So just to make it easier for copy pasting:

  1. Open mongo shell (if added to PATH, just type: mongo into the shell)
  2. Switch to your database: use <databasename>
  3. Issue the following commands to create the indices: db.runs.createIndex({ "result": -1 }); db.runs.createIndex({ "experiment.name": -1 }); db.runs.createIndex({ "command": -1 }); db.runs.createIndex({ "host.hostname: -1 }); db.runs.createIndex({ "start_time": -1 }); db.runs.createIndex({ "heartbeat": -1 });
SumNeuron commented 5 years ago

Hi I am getting this issue when using the FileObserver...

me:~/Projects/sacred_test$ sacredboard -F experiments/tests/
[2019-03-01 18:57:28,825] ERROR in app: Exception on /api/run [GET]
Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/site-packages/flask/app.py", line 2292, in wsgi_app
    response = self.full_dispatch_request()
  File "/anaconda3/lib/python3.6/site-packages/flask/app.py", line 1815, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/anaconda3/lib/python3.6/site-packages/flask/app.py", line 1718, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/anaconda3/lib/python3.6/site-packages/flask/_compat.py", line 35, in reraise
    raise value
  File "/anaconda3/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request
    rv = self.dispatch_request()
  File "/anaconda3/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/anaconda3/lib/python3.6/site-packages/sacredboard/app/webapi/runs.py", line 16, in api_runs
    return get_runs()
  File "/anaconda3/lib/python3.6/site-packages/sacredboard/app/webapi/runs.py", line 94, in get_runs
    recordsFiltered=records_filtered),
  File "/anaconda3/lib/python3.6/site-packages/flask/templating.py", line 135, in render_template
    context, ctx.app)
  File "/anaconda3/lib/python3.6/site-packages/flask/templating.py", line 117, in _render
    rv = template.render(context)
  File "/anaconda3/lib/python3.6/site-packages/jinja2/asyncsupport.py", line 76, in render
    return original_render(self, *args, **kwargs)
  File "/anaconda3/lib/python3.6/site-packages/jinja2/environment.py", line 1008, in render
    return self.environment.handle_exception(exc_info, True)
  File "/anaconda3/lib/python3.6/site-packages/jinja2/environment.py", line 780, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/anaconda3/lib/python3.6/site-packages/jinja2/_compat.py", line 37, in reraise
    raise value.with_traceback(tb)
  File "/anaconda3/lib/python3.6/site-packages/sacredboard/templates/api/runs.js", line 7, in top-level template code
    {%- for run in runs -%}
  File "/anaconda3/lib/python3.6/site-packages/jinja2/runtime.py", line 435, in __init__
    self._after = self._safe_next()
  File "/anaconda3/lib/python3.6/site-packages/jinja2/runtime.py", line 455, in _safe_next
    return next(self._iterator)
  File "/anaconda3/lib/python3.6/site-packages/sacredboard/app/data/filestorage/rundao.py", line 41, in run_iterator
    yield self.get(id)
  File "/anaconda3/lib/python3.6/site-packages/sacredboard/app/data/filestorage/rundao.py", line 60, in get
    run = _read_json(_path_to_run(self.directory, run_id))
  File "/anaconda3/lib/python3.6/site-packages/sacredboard/app/data/filestorage/rundao.py", line 101, in _read_json
    return json.load(f)
  File "/anaconda3/lib/python3.6/json/__init__.py", line 299, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/anaconda3/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/anaconda3/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/anaconda3/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

SacredBoard shows:

DataTables warning: table id=runs - Ajax error. For more information about this error, please see http://datatables.net/tn/7