fossasia / visdom

A flexible tool for creating, organizing, and sharing visualizations of live, rich data. Supports Torch and Numpy.
Apache License 2.0
10.01k stars 1.14k forks source link

Window not created when initial values are NaN / Inf #732

Open jramapuram opened 4 years ago

jramapuram commented 4 years ago

Bug Description When using Nvidia APEX for fp16 training certain values which are originally Inf/Nan can actually later be updated to not be through the dynamic quantization employed.

Eg @ initial training:

train-0[Epoch 1][182528 samples][1361.00 sec]:   Loss: 16012.5879       -ELBO: 16017.5146       NLL: 15677.4766 KLD: 340.0453   MI: 0.0000
test-0[Epoch 1][19968 samples][7.28 sec]:        Loss: inf      -ELBO: inf      NLL: 13485.9990 KLD: inf        MI: 0.0000

Eg @ later training:

train-0[Epoch 21][182656 samples][1364.02 sec]:  Loss: 198.0622 -ELBO: 224.9671 NLL: 173.3995   KLD: 51.5676    MI: 0.0000
test-0[Epoch 21][19968 samples][7.32 sec]:       Loss: 167.9006 -ELBO: 203.4540 NLL: 146.3381   KLD: 57.1159    MI: 0.0000

Reproduction Steps Enter steps to reproduce the behavior:

  1. Create a new window (vis.line) and initially send NaN / Inf values
  2. Try updating the values with scalars later (update="append")

Expected behavior

Plot is correctly created and updated. Currently the window fails to get created

JackUrb commented 4 years ago

When the window fails to get created, is anything dumped on either the python server running visdom or the javascript console trying to view your results?

jramapuram commented 4 years ago

Will collect some data and post back!

jramapuram commented 4 years ago

Nothing weird in server logs (didn't try verbose yet, but can if needed), but cat-ing the json might give a clue: some fields below are eg: "y": [Infinity]. Can visdom handle this?

"caption": null}], "selected": 0, "type": "image_history", "show_slider": true, "i": 9}, "window_386d6be8938340": {"command": "window", "id": "window_386d6be8938340", "title": "test_loss", "inflate": true, "width": null, "height": null, "contentID": "b41bebb5-a6b9-427a-9281-a77d014fb633", "content": {"data": [{"x": [1.0], "y": [Infinity], "name": "test_loss", "type": "scatter", "mode": "lines", "textposition": "right", "line": {}, "marker": {"size": 10, "symbol": "dot", "line": {"color": "#000000", "width": 0.5}}}], "layout": {"showlegend": false, "title": "test_loss", "margin": {"l": 60, "r": 60, "t": 60, "b": 60}, "xaxis": {"title": "epoch"}, "yaxis": {"title": "loss"}}}, "type": "plot", "i": 10}, "window_386d6be89d7660": {"command": "window", "id": "window_386d6be89d7660", "title": "test_elbo", "inflate": true, "width": null, "height": null, "contentID": "623afba0-7c13-4976-96f5-ff36aaca63c0", "content": {"data": [{"x": [1.0], "y": [Infinity], "name": "test_elbo", "type": "scatter", "mode": "lines", "textposition": "right", "line": {}, "marker": {"size": 10, "symbol": "dot", "line": {"color": "#000000", "width": 0.5}}}], "layout": {"showlegend": false, "title": "test_elbo", "margin": {"l": 60, "r": 60, "t": 60, "b": 60}, "xaxis": {"title": "epoch"}, "yaxis": {"title": "elbo"}}}, "type": "plot", "i": 11}, "window_386d6be8ab8484": {"command": "window", "id": "window_386d6be8ab8484", "title": "test_nll", "inflate": true, "width": null, "height": null, "contentID": "1882d0f6-28e3-4361-a906-4f76f289bd0a", "content": {"data": [{"x": [1.0], "y": [2775.734375], "name": "test_nll", "type": "scatter", "mode": "lines", "textposition": "right", "line": {}, "marker": {"size": 10, "symbol": "dot", "line": {"color": "#000000", "width": 0.5}}}], "layout": {"showlegend": false, "title": "test_nll", "margin": {"l": 60, "r": 60, "t": 60, "b": 60}, "xaxis": {"title": "epoch"}, "yaxis": {"title": "nll"}}}, "type": "plot", "i": 12}, "window_386d6be8b57370": {"command": "window", "id": "window_386d6be8b57370", "title": "test_kld", "inflate": true, "width": null, "height": null, "contentID": "95dd04dc-3d3c-4839-8f12-a6a43701614f", "content": {"data": [{"x": [1.0], "y": [Infinity], "name": "test_kld", "type": "scatter", "mode": "lines", "textposition": "right", "line": {}, "marker": {"size": 10, "symbol": "dot", "line": {"color": "#000000", "width": 0.5}}}], "layout": {"showlegend": false, "title": "test_kld", "margin": {"l": 60, "r": 60, "t": 60, "b": 60}, "xaxis": {"title": "epoch"}, "yaxis": {"title": "kld"}}}, "type": "plot", "i": 13}, "window_386d6be8bf5aec": {"command": "window", "id": "window_386d6be8bf5aec", "title": "test_kl-beta", "inflate": true, "width": null, "height": null, "contentID": "3e1c987d-411c-426c-a619-b966eb8e4e06", "content": {"data": [{"x": [1.0], "y": [0.9568005932552583], "name": "test_kl-beta", "type": "scatter", "mode": "lines", "textposition": "right", "line": {}, "marker": {"size": 10, "symbol": "dot", "line": {"color": "#000000", "width": 0.5}}}], "layout": {"showlegend": false, "title": "test_kl-beta", "margin": {"l": 60, "r": 60, "t": 60, "b": 60}, "xaxis": {"title": "epoch"}, "yaxis": {"title": "kl-beta"}}}, "type": "plot", "i": 14}, "window_386d6be8cb43d2": {"command": "window", "id": "window_386d6be8cb43d2", "title": "test_proxy", "inflate": true, "width": null, "height": null, "contentID": "94f93b0e-9d8d-4e20-9009-2ca9d8cd64cc", "content": {"data": [{"x": [1.0], "y": [0.0], "name": "test_proxy", "type": "scatter", "mode": "lines", "textposition": "right", "line": {}, "marker": {"size": 10, "symbol": "dot", "line": {"color": "#000000", "width": 0.5}}}], "layout": {"showlegend": false, "title": "test_proxy", "margin": {"l": 60, "r": 60, "t": 60, "b": 60}, "xaxis": {"title": "epoch"}, "yaxis": {"title": "proxy"}}}, "type": "plot", "i": 15}, "window_386d6be8d4763a": {"command": "window", "id": "window_386d6be8d4763a", "title": "test_mut-info", "inflate": true, "width": null, "height": null, "contentID": "ec7198d6-2ba8-46c1-a70e-6a5eb490de8b", "content": {"data": [{"x": [1.0], "y": [0.0], "name": "test_mut-info", "type": "scatter", "mode": "lines", "textposition": "right", "line": {}, "marker": {"size": 10, "symbol": "dot", "line": {"color": "#000000", "width": 0.5}}}], "layout": {"showlegend": false, "title": "test_mut-info", "margin": {"l": 60, "r": 60, "t": 60, "b": 60}, "xaxis": {"title": "epoch"}, "yaxis": {"title": "mut-info"}}}, "type": "plot", "i": 16}, "test_input": {"command": "window", "id": "test_input", "title": "test_input", "inflate": true, "width": 274, "height": 274, "contentID":
JackUrb commented 4 years ago

We use plotly as our underlying graph renderer, so if it's getting this far but not displaying I'd expect them to throw an error in the js console.

jramapuram commented 4 years ago

@JackUrb : yup you were right:

Uncaught SyntaxError: Unexpected number
:8097/extensions/MathZoom.js?V=2.7.1:2 Uncaught SyntaxError: Unexpected number
2neuralnetworkart.com/:1 Uncaught SyntaxError: Unexpected token I in JSON at position 216
    at JSON.parse (<anonymous>)
    at WebSocket.e._handleMessage (main.js?v=6c8a353abc101b0e2c1cf327726f1caa:48)
neuralnetworkart.com/:1 Uncaught SyntaxError: Unexpected token I in JSON at position 215
    at JSON.parse (<anonymous>)
    at WebSocket.e._handleMessage (main.js?v=6c8a353abc101b0e2c1cf327726f1caa:48)
3neuralnetworkart.com/:1 Uncaught SyntaxError: Unexpected token I in JSON at position 750
    at JSON.parse (<anonymous>)
    at WebSocket.e._handleMessage (main.js?v=6c8a353abc101b0e2c1cf327726f1caa:48)
2neuralnetworkart.com/:1 Uncaught SyntaxError: Unexpected token I in JSON at position 221
    at JSON.parse (<anonymous>)
    at WebSocket.e._handleMessage (main.js?v=6c8a353abc101b0e2c1cf327726f1caa:48)
neuralnetworkart.com/:1 Uncaught SyntaxError: Unexpected token I in JSON at position 220
    at JSON.parse (<anonymous>)
    at WebSocket.e._handleMessage (main.js?v=6c8a353abc101b0e2c1cf327726f1caa:48)
JackUrb commented 4 years ago

Seems there needs to be a layer on either the JSON encoder from the server (probably easier) or the JSON decoder in the javascript to be able to handle the infinity conversion.