Closed cwoolfo1 closed 2 years ago
I don't think so, since TorchStudio is able to detect it. Please do the following:
Previously it was fine, but now it is stuck on listing devices when I try to refresh the server
Ok, do you know what hardware this remote server has ? Specifically, do you know if there's any NVIDIA GPU on that remote server ? Is it a custom remote server, or a server from AWS/Azure/Google Cloud ?
Here's what happens on the remote server when it says "Listing devices...":
print("Listing devices...\n", file=sys.stderr)
devices = {}
devices['cpu'] = {'name': 'CPU', 'pin_memory': False}
for i in range(torch.cuda.device_count()):
devices['cuda:'+str(i)] = {'name': torch.cuda.get_device_name(i), 'pin_memory': True}
As you can see it tries to scan CUDA devices. Your remote server is likely stuck at this point.
Yes it has two nvidia gpus. Initially it wasn't giving me this issue.
I have resolved that problem, but I am still having my original issue.
I tried training the model on my remote server and get the following message: Error: could not connect to remote server
Here's what happens when it fails. It tries to connect to the SSH server, and fail if it cannot connect after 5 seconds (or earlier if it cannot connect at all):
print("Connecting to remote server...", file=sys.stderr)
try:
sshclient.connect(hostname=args.sshaddress, port=args.sshport, username=args.username, password=args.password, pkey=paramiko.RSAKey.from_private_key_file(args.keyfile) if args.keyfile else None, timeout=5)
except:
print("Error: could not connect to remote server", file=sys.stderr)
exit()
The thing is, TorchStudio use the exact same procedure when you click "Refresh", which seem to work.
So maybe the server/network isn't responsive enough and sometime it takes longer than 5 seconds to establish the SSH connection ?
What you can do is open ~/TorchStudio/torchstudio/sshtunnel.py in a text editor, and change
sshclient.connect(hostname=args.sshaddress, port=args.sshport, username=args.username, password=args.password, pkey=paramiko.RSAKey.from_private_key_file(args.keyfile) if args.keyfile else None, timeout=5)
to:
sshclient.connect(hostname=args.sshaddress, port=args.sshport, username=args.username, password=args.password, pkey=paramiko.RSAKey.from_private_key_file(args.keyfile) if args.keyfile else None, timeout=15)
(I only changed the timeout setting at the end from 5 to 15).
Then save the file, run TorchStudio again and see if it helps with remote trainings.
PS: if your remote server is accessible from internet and if you're able to generate a temporary key file to access it, I can it a try from my side - let me know.
paramiko/client.py",` line 767, in _auth
raise SSHException("No authentication methods available")
paramiko.ssh_exception.SSHException: No authentication methods available
I commented out the try and except blocks from the ssh.connect line of the code and got the following issue.
It seems that the password is not being stored in the arg parser.
@cwoolfo1 You're right, looking at the GUI C++ source I indeed see that the password is not sent when training starts (only the keyfile, if any). Fixing this and uploading a new build today.
@cwoolfo1 TorchStudio 0.9.4 has just been released, fixing the remote server connection issue: https://www.torchstudio.ai/download/
It appears that your implemented fix generated a new error. Now when I go into settings and refresh my server it gives me the same authentication error. I tried going into edit to re-enter my password and then refresh and I am still getting the same error message
Hurrah! I have solved the error on my end!
Ah, glad to hear that, because I read and re-read the source code and couldn't find any mistake this time...
I have added a server to my remote servers list and it was successful. However, when I go to train a model I get an error telling me it could not connect to the server.
I am connecting to this remote server through a VPN. Could that be the thing causing this issue?