TorchStudio / torchstudio

IDE for PyTorch and its ecosystem
https://torchstudio.ai
MIT License
378 stars 27 forks source link

Issues with remote server connection #4

Closed cwoolfo1 closed 2 years ago

cwoolfo1 commented 2 years ago

I have added a server to my remote servers list and it was successful. However, when I go to train a model I get an error telling me it could not connect to the server.

I am connecting to this remote server through a VPN. Could that be the thing causing this issue?

divideconcept commented 2 years ago

I don't think so, since TorchStudio is able to detect it. Please do the following:

  1. Go to the servers list (Menu > Settings), select your remote server, click Refresh, and copy here its final status (green text or red text). Then click OK to return to the main interface.
  2. In the Dataset tab, at the top select torchvision.datasets and MNIST. At the bottom, select Local and click Load.
  3. Create a Model tab, at the top select torchstudio.models and MNISTClassifier. At the bottom, click Build.
  4. At the bottom of the Hyperparameters panel select a device from your remote server, and click Train. Please copy here any error status.
cwoolfo1 commented 2 years ago

Previously it was fine, but now it is stuck on listing devices when I try to refresh the server

divideconcept commented 2 years ago

Ok, do you know what hardware this remote server has ? Specifically, do you know if there's any NVIDIA GPU on that remote server ? Is it a custom remote server, or a server from AWS/Azure/Google Cloud ?

Here's what happens on the remote server when it says "Listing devices...":

    print("Listing devices...\n", file=sys.stderr)

    devices = {}
    devices['cpu'] = {'name': 'CPU', 'pin_memory': False}
    for i in range(torch.cuda.device_count()):
        devices['cuda:'+str(i)] = {'name': torch.cuda.get_device_name(i), 'pin_memory': True}

As you can see it tries to scan CUDA devices. Your remote server is likely stuck at this point.

cwoolfo1 commented 2 years ago

Yes it has two nvidia gpus. Initially it wasn't giving me this issue.

cwoolfo1 commented 2 years ago

I have resolved that problem, but I am still having my original issue.

I tried training the model on my remote server and get the following message: Error: could not connect to remote server

divideconcept commented 2 years ago

Here's what happens when it fails. It tries to connect to the SSH server, and fail if it cannot connect after 5 seconds (or earlier if it cannot connect at all):

    print("Connecting to remote server...", file=sys.stderr)
    try:
        sshclient.connect(hostname=args.sshaddress, port=args.sshport, username=args.username, password=args.password, pkey=paramiko.RSAKey.from_private_key_file(args.keyfile) if args.keyfile else None, timeout=5)
    except:
        print("Error: could not connect to remote server", file=sys.stderr)
        exit()

The thing is, TorchStudio use the exact same procedure when you click "Refresh", which seem to work. So maybe the server/network isn't responsive enough and sometime it takes longer than 5 seconds to establish the SSH connection ? What you can do is open ~/TorchStudio/torchstudio/sshtunnel.py in a text editor, and change sshclient.connect(hostname=args.sshaddress, port=args.sshport, username=args.username, password=args.password, pkey=paramiko.RSAKey.from_private_key_file(args.keyfile) if args.keyfile else None, timeout=5) to: sshclient.connect(hostname=args.sshaddress, port=args.sshport, username=args.username, password=args.password, pkey=paramiko.RSAKey.from_private_key_file(args.keyfile) if args.keyfile else None, timeout=15) (I only changed the timeout setting at the end from 5 to 15). Then save the file, run TorchStudio again and see if it helps with remote trainings.

divideconcept commented 2 years ago

PS: if your remote server is accessible from internet and if you're able to generate a temporary key file to access it, I can it a try from my side - let me know.

cwoolfo1 commented 2 years ago
paramiko/client.py",` line 767, in _auth
    raise SSHException("No authentication methods available")
paramiko.ssh_exception.SSHException: No authentication methods available

I commented out the try and except blocks from the ssh.connect line of the code and got the following issue.

It seems that the password is not being stored in the arg parser.

divideconcept commented 2 years ago

@cwoolfo1 You're right, looking at the GUI C++ source I indeed see that the password is not sent when training starts (only the keyfile, if any). Fixing this and uploading a new build today.

divideconcept commented 2 years ago

@cwoolfo1 TorchStudio 0.9.4 has just been released, fixing the remote server connection issue: https://www.torchstudio.ai/download/

cwoolfo1 commented 2 years ago

It appears that your implemented fix generated a new error. Now when I go into settings and refresh my server it gives me the same authentication error. I tried going into edit to re-enter my password and then refresh and I am still getting the same error message

cwoolfo1 commented 2 years ago

Hurrah! I have solved the error on my end!

divideconcept commented 2 years ago

Ah, glad to hear that, because I read and re-read the source code and couldn't find any mistake this time...