fossasia / visdom

A flexible tool for creating, organizing, and sharing visualizations of live, rich data. Supports Torch and Numpy.
Apache License 2.0
10.04k stars 1.13k forks source link

How can I setup Visdom on a remote server using slurm? #828

Open neuronphysics opened 2 years ago

neuronphysics commented 2 years ago

I want to use visdom to visualize the results of my trained deep learning algorithm which has been running on a remote cluster server. First I am wondering whether I should use special command line to connect via ssh to the cluster or not to be able to see the visdom plots?

In my slurm script I used the following command line: python -u script.py --visdom_server "http://ncc1.clients.dur.ac.uk" --visdom_port 8098 and in my python script

#Plotting on remote server
import visdom
cfg = {"server": "ncc1.clients.dur.ac.uk",
       "port": 8098}
vis = visdom.Visdom('http://' + cfg["server"], port = cfg["port"])

win = None

def update_viz(epoch, loss, title):
    global win

    if win is None:
        title = title

        win = viz.line(
            X=np.array([epoch]),
            Y=np.array([loss]),
            win=title,
            opts=dict(
                title=title,
                fillarea=True
            )
        )
    else:
        viz.line(
            X=np.array([epoch]),
            Y=np.array([loss]),
            win=win,
            update='append'
        )

I got this error:

requests.exceptions.InvalidURL: Failed to parse: http://http::8098/env/main
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
Visdom python client failed to establish socket to get messages from the server. This feature is optional and can be disabl
ed by initializing Visdom with `use_incoming_socket=False`, which will prevent waiting for this request to timeout.
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
script.py:41: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().d
etach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  params['w'].append(nn.Parameter(torch.tensor(Normal(torch.zeros(n_in, n_out), std * torch.ones(n_in, n_out)).rsample(), r
equires_grad=True, device=device)))
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
script.py:42: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().d
etach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  params['b'].append(nn.Parameter(torch.tensor(torch.mul(bias_init, torch.ones([n_out,])), requires_grad=True, device=devic
e)))
script.py:292: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().
detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  return torch.exp(torch.lgamma(torch.tensor(a, dtype=torch.float, requires_grad=True).to(device=local_device)) + torch.lga
mma(torch.tensor(b, dtype=torch.float, requires_grad=True).to(device=local_device)) - torch.lgamma(torch.tensor(a+b, dtype=
torch.float, requires_grad=True).to(device=local_device)))
script.py:679: UserWarning: This overload of add_ is deprecated:
    add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
    add_(Tensor other, *, Number alpha) (Triggered internally at  /opt/conda/conda-bld/pytorch_1631630815121/work/torch
/csrc/utils/python_arg_parser.cpp:1025.)
  exp_avg.mul_(beta1).add_(1 - beta1, grad)
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
Traceback (most recent call last):
  File "script.py", line 871, in <module>
    update_viz(epoch, elbo2.item(),' Loss by Epoch')
  File "script.py", line 736, in update_viz
    win = viz.line(
NameError: name 'viz' is not defined

How can I run my plotting script on a remote server? Is there anyway to do this? Thanks.

JackUrb commented 2 years ago

Hi @neuronphysics, one way to manage this kind of setup is with an ssh tunnel, such that you can still log to localhost at the port you tunnel. This isn't required to get a remote server working, however it does make the semantics equivalent to if you run the server and the plotting script on the same machine.

That being said, it seems something isn't quite right with your underlying setup:

Failed to parse: http://http::8098/env/main

You can see here how we parse the incoming domain and configuration details: https://github.com/fossasia/visdom/blob/026958a66ce743f59e8f5232e974138c76b31675/py/visdom/__init__.py#L392-L405

It might be worthwhile to add some print statements to understand why it is we're parsing out http://http::8098/env/main as the final address, rather than the http://ncc1.clients.dur.ac.uk:8098/env/main you may expect.