coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

Error using notebook --sync #267

Closed clement-chaneching closed 5 months ago

clement-chaneching commented 5 months ago

Hello and happy new year 2024,

I just started testing coiled and I have some issues when trying to run the file sync with coiled notebook start --sync

Error: unable to connect to beta: unable to connect to endpoint: unable to dial 
agent endpoint: unable to handshake with agent process: unable to receive server
magic number: EOF (error output: Warning: Permanently added the ED25519 host key
for IP address '35.201.24.109' to the list of known hosts.\r
ubuntu@cluster-wqjgy.dask.host: Permission denied (publickey).)
image

I am using WSL with mutagen and openssh installed: OpenSSH_8.4p1 Debian-5+deb11u3, OpenSSL 1.1.1w 11 Sep 2023 Mutagen version 0.17.4 I have created keys using ssh-keygen and used ssh-add, and I can see the cluster key in my known_hosts.

Is there something I am missing? Do I need to add my public key somewhere in coiled? I can successfully use SSH to connect to other machines or Github.

Thanks for your help!

ntabris commented 5 months ago

Hi, @clement-chaneching. Sorry that you've run into a problem with this!

Coiled generates (and uses) a unique SSH keypair for each cluster, so you shouldn't need to worry about your own SSH keys. We do sometimes have some issues with mutagen, though, and I'd like to know if you're running into a problem there or a different problem. Would you mind trying a coiled run command like coiled run echo hello and see if that works? It also uses SSH, so that should help narrow down the problem.

clement-chaneching commented 5 months ago

Hello, Thanks for the quick reply @ntabris ! coiled run does work:

image
ntabris commented 5 months ago

A few more things to try:

coiled run echo hello --keepalive 10m

and then

coiled cluster ssh

(which is a wrapper around OpenSSH).

If that works, then the issue is something more specific to mutagen.

Since your own Windows, these issues about the default shell might be relevant:

My apologies I don't have a definite solution but those are things to try. Let me know what you see (or if you have any questions).

clement-chaneching commented 5 months ago

Hello,

It worked so I guess I ll have to check how mutagen works with WSL.

image

Thanks anyway, I'll let you know if I find something!

clement-chaneching commented 5 months ago

So I m trying to run a notebook with --sync but I m getting

Error attempting to connect sync...

Error: unable to connect to beta: unable to connect to endpoint: unable to dial 
agent endpoint: unable to handshake with agent process: unable to receive server
magic number: EOF (error output: Warning: Permanently added the ED25519 host key
for IP address '35.244.113.225' to the list of known hosts.\r
ubuntu@cluster-qhngw.dask.host: Permission denied (publickey).)

So I have setup a notebook without sync using coiled notebook start When I try to run mutagen or ssh on ubuntu@cluster-bweds.dask.host, i m getting "permission denied public key".

But I can ssh in this machine using coiled cluster ssh.

 coiled cluster ssh
===Starting SSH session to scheduler at cluster-bweds.dask.host===

And if I add my public key in the authorized keys, then I can run ssh ubuntu@cluster-bweds.dask.host and mutagen sync create and it works fine. Logged in : ubuntu@coiled-dask-clement6e-347796-scheduler-5231266:~$

Do you have any idea why I would be able to run coiled cluster ssh but not ssh in the notebook cluster or run --sync?

ntabris commented 5 months ago

Hm. So it works if you manually put your own key on the VM and then manually run mutagen sync create.

I'm curious if it works if you run coiled cluster ssh --add-key (which adds the key we made for this cluster to your ssh agent) and then manually making the mutagen sync.

I'm also curious if you have many identities already loaded in your ssh agent (i.e., how many things does ssh-add -l show?).

clement-chaneching commented 5 months ago

Yep I just created a new notebook. Cannot ssh, I use the coiled cluster ssh --add-key and then I can SSH.

image

But I still cannot use mutagen unless I add my public key in the cluster ~/.ssh/authorized_keys And then I can use mutagen in the VM, but because notebooks are running on docker /tmp, I still cannot sync between local and notebooks.

Now I just have the 2 clusters that worked in my ssh agent :

image

Thanks a lot for your help and for following up! Let me know if you have any solution.

ntabris commented 5 months ago

If you'd like to try some more troubleshooting, there's a new version of the coiled package that includes some extra debug options.

You'd pip install coiled==1.3.2 to get this version.

You'd then start a notebook by running coiled notebook start --name test-name --no-block. This would give us a notebook that doesn't yet have sync running.

(test-name can be anything, but the other commands will reference cluster by name, so it's easier if we specify name and can use that, rather than letting start pick a random name like it does by default.)

Here's the new command that will attempt to start sync on the notebook:

coiled notebook start-sync test-name --debug

If this works, great! But it probably won't.

It will print out the commands it's running, though, so you could

  1. make sure known_hosts has the new fingerprint like it's supposed to
  2. try running the mutagen command manually
  3. if that doesn't work, try running coiled cluster ssh --add-key and then running the mutagen command manually
  4. if that doesn't work, try copying your personal key to the VM (like you've done before) and then running mutagen command manually

That should help narrow down where the problem is.

ntabris commented 5 months ago

Oh, and when you're done, it's coiled notebook stop test-name to stop the notebook. (Normally this happens when you control-c the widget, but --no-block disabled that.)

clement-chaneching commented 5 months ago

Hello!

Thanks for all the help. So I did follow your instructions and it worked exactly as you expected. I ran coiled notebook start-sync test-name --debug and it didnt work. The new fingerprint has been succesfully added in the known_hosts and I can ssh in the VM, but I cannot use mutagen

image

It does work when I manually add my id_rsa.pub inside the known_hosts of the coiled VM.

So I guess I have something to fix in the mutagen config ?

But even if I fix that, I dont understand how it is supposed to sync between JupyterLab and my local folder, since it will sync to remote:/scratch/synced, but the notebooks I create from the UI are in a /docker/tmp folder.

Or is there something I am missing?

ntabris commented 5 months ago

Thanks for all your patience trying to figure this out together!

At the moment I'm puzzled why your key would work but our key (which apparently does work for coiled cluster ssh) would not work for mutagen. This isn't something I've seen before.

If you manually add your id_rsa.pub to the Coiled VM, does coiled notebook start-sync test-name --debug then work?

But even if I fix that, I dont understand how it is supposed to sync between JupyterLab and my local folder, since it will sync to remote:/scratch/synced, but the notebooks I create from the UI are in a /docker/tmp folder.

Yeah, there's some other stuff we do (mount inside docker + symlink) to make this all work.

clement-chaneching commented 5 months ago

Yes it does work when I add my id_rsa.pub to the coiled VM and run mutagen.

Yeah, there's some other stuff we do (mount inside docker + symlink) to make this all work.

Oh ok, so is it documented anywhere? I dont understand why we would have a notebook --sync flag if it doesnt sync the notebooks. Or how can we have a persistent volume for the notebooks?

ntabris commented 5 months ago

@clement-chaneching and I did some troubleshooting on a call.

He was able to repeatedly get things working with

coiled notebook start --name test-name --no-block
coiled notebook start-sync test-name --debug

but got a mutegen SSH error when running coiled notebook start --sync.

This is very puzzling since start-sync literally runs the same code that's run when you use --sync.

Clement said that he's satisfied using the two separate commands as a workaround.