Unable to start a remote REPL

danielmatz commented 2 years ago

I was excited to hear about the new remote REPL capability. I was just giving it a try, but it fails to start the REPL. Here's what I see in my *Messages* buffer:

Copying /Users/dmatz/.emacs.d/straight/build/julia-snail/JuliaSnail.jl to /scpx:fsl:/tmp/julia-snail-9b77eb2bfeca59ef3ac3495f278cb1efa7f5e14657cb93858f9b8212620d055a/JuliaSnail.jl...done
Copying /Users/dmatz/.emacs.d/straight/build/julia-snail/Manifest.toml to /scpx:fsl:/tmp/julia-snail-9b77eb2bfeca59ef3ac3495f278cb1efa7f5e14657cb93858f9b8212620d055a/Manifest.toml...done
Copying /Users/dmatz/.emacs.d/straight/build/julia-snail/Project.toml to /scpx:fsl:/tmp/julia-snail-9b77eb2bfeca59ef3ac3495f278cb1efa7f5e14657cb93858f9b8212620d055a/Project.toml...done
Starting Julia process and loading Snail...
user-error: The vterm buffer is inactive; double-check julia-snail-executable path

My julia-snail-executable variable is still set to the default of julia. If I launch vterm manually, I can do which julia successfully. I can also use shell-command to run which julia, and that works.

I saw you had a julia-snail-debug variable, but when I set it to t, I didn't get any additional output.

Any thoughts on what is going on?

Thanks in advance for the help! And thanks for Snail!

gcv commented 2 years ago

That error means that the Julia executable binary has not been found on the remote host. To use the remote REPL, you need to have Julia installed remotely, and whatever shell you run when you ssh to the remote host has to be able to find julia-snail-executable. The default should work if julia is on the remote host’s login shell’s PATH. So when you say that which julia works, do you mean it works on the local or the remote host?

Note that you can set julia-snail-executable per-project using .dir-locals.el, so it’s not necessarily a global setting.

danielmatz commented 2 years ago

Setting julia-snail-executable to the full path to my remote julia worked! I'm not sure why it wasn't finding it on the PATH. When I was talking about which julia, that was indeed on the remote host.

gcv commented 2 years ago

Maybe it's because of the distinction between interactive, non-interactive, login, and non-login shells? See this discussion — https://unix.stackexchange.com/questions/38175/difference-between-login-shell-and-non-login-shell — and I was being sloppy upthread when I said it has to be on the login shell's path.

Snail just opens an ssh tunnel to the remote host. I'm not 100% sure what kind of shell that spins up (login? interactive?), but it probably does not run the same startup file that your normal interactive login shell does. As a quick test, you can try doing something like ssh myhost 'echo $PATH' or ssh myhost 'which julia' and see if that gives a reasonable output.

danielmatz commented 2 years ago

Yes, I think you are right. Doing ssh myhost 'which julia' can't find julia.

Should Snail honor tramp-remote-path?

gcv commented 2 years ago

I looked at the manual entry for tramp-remote-path (https://www.gnu.org/software/emacs/manual/html_node/tramp/Remote-programs.html) and it gave me the impression that it's just there to help Tramp itself find some basic utilities it needs to operate (like ls). Since Snail works at a higher level, it doesn't seem to me like the right thing. But I'm willing to change my mind. Do you use tramp-remote-path for something like this?

danielmatz commented 2 years ago

Well, I'll be... I guess I've always misunderstood that variable. I was thinking it was the reason that my shell-command example where I ran which julia on the remote host worked. But I think you were on the right track to begin with. TRAMP is starting up a shell (but with a different combination of interactive and login), and the shell's own configuration is what is allowing it to find the julia executable... Sorry for pointing you in the wrong direction.

I was just trying to think of a way to allow julia-snail-executable to be set to "julia" and have it honor the remote host's PATH.

You've been very patient and kind. I hope you don't mind one last attempt on my part.

Right now, you use the same ssh call to establish the tunnel and to launch Julia. Could you separate those? If you first establish the tunnel, and then use start-file-process (or similar) to launch Julia, then you'll essentially be deferring to how TRAMP handles remote processes. So, if someone has already configured their remote system and TRAMP to be able to find julia, it should "just work."

gcv commented 2 years ago

Setting Snail aside for a moment, let me clarify something: you configured Tramp to launch a Julia REPL? If so, could you help me understand your workflow (which I assume from your previous comments you used before Snail added remote REPL support)?
How is PATH set on your remote system, and which shell do you use? Is it possible that your Julia binary is being set in a file that doesn't get loaded by a non-interactive shell (which seems to be what ssh host 'which julia' runs)? I just did a bit of research and testing on this, and the tldr is:

The distinction between login and non-login shells is historical silliness and just adds complexity to shell startup scripts. In any case, Snail executes ssh -t which supposedly allocates a tty and makes it a login shell. 🤷🏻‍♂️
Zsh: .zshenv is always executed; .zshrc is only executed by interactive shells.
Bash: .bashrc is only executed by non-interactive shells; .bash_profile is only executed by interactive shells. To run your setup regardless of shell type, you're supposed to put everything in .bashrc and source .bashrc from .bash_profile.
No idea about Fish, csh/tcsh, or any exotic shells.

There is more complexity if you have global configurations in /etc, which I can imagine being relevant if you run Julia on a cluster which does things like put binaries in places like /opt/julia/1/6/2/bin and configures PATH in /etc/profile.

I'm stressing this because it seems to me that you expect both interactive and non-interactive shells to have julia set on the PATH, but have not actually set up your remote environment to do so.

PS: Once we get to the bottom of this, I will update the documentation to clarify all this complexity.

danielmatz commented 2 years ago

My old workflow for remote machines was to use M-x compile to run julia in more of a scripting mode. TRAMP seems to use /bin/sh by default, so I have my remote .profile configured to set up my PATH, and things just work. If I needed a full REPL on the remote machine, I would just open a shell with vterm. And in that case I'd get a Bash shell, which I also configured to set up PATH properly.

So, Snail's ssh command should be getting a Bash shell when it connects. My .bash_profile has the PATH configured. Based on your comments, I moved that PATH manipulation into my .bashrc and made sure my .bash_profile was sourcing my .bashrc. I still got the same error.

Our cluster does indeed manipulate PATH in /etc. We have an environment module system. I think you are right that that is the root problem here. That is, ssh myhost "which julia" doesn't work. I have to wrap the command in another invocation to bash and force it to load the settings, something like ssh myhost 'bash -l -c "which julia"'. I'm definitely not asking Snail to do something like that.

I'm sorry that this devolved into you debugging my shell configuration... In the end, I think my best option is where this all started, with me setting julia-snail-executable. I'll probably use the new connection-local variable feature.

gcv commented 2 years ago

This is pretty interesting. I expect setups like yours to come up somewhat frequently, and want to add guidance to the documentation.

I just triple-checked, and bash definitely reads .bashrc, and the Snail ssh connection works in my test environment when there's something odd about the location of the Julia binary but where a special PATH entry exists in .bashrc.

I did find a bug which occurs if you have a different default username configured for your remote host in .ssh/config from the one used in the Tramp connection string (e.g., if .ssh/config says myhost should use default username myname1 but your Tramp connection string was /ssh:myname2@myhost:). If that's the case in your setup, then it would explain why the .bashrc change didn't work for you. Fixed in 5b9d95f. It's not in MELPA yet (CI will grab it in the next couple of hours), but since it looks like you use straight.el, you can pull the change right away.

Assuming that doesn't fix your problem, the next thing you should check is that ssh myhost 'echo $SHELL' is actually bash. That you have to dance with 'bash -l -c "which julia"' suggests that something unusual is going on (maybe your cluster has another shell as your login default shell which then does exec bash at the end of its own configuration file). I would also look at ssh myhost 'echo $PATH' for clues about what your cluster is doing when you log in.

danielmatz commented 2 years ago

Sorry, I don't think I explained myself clearly. I think the core issue is that when you run a command using ssh, the shell it starts up is not interactive and is not a login shell, and so /etc/profile is not sourced. That means the environment module system that we use on our lab never gets set up. That means that my PATH doesn't get configured properly by my .bashrc, which is indeed being run. In fact, it prints out an error, because the environment module commands fail.

The point of wrapping the command in bash -l was to force bash to start up in such a way that it sources /etc/profile. See this excerpt from the bash man page:

When bash is invoked as an interactive login shell, or as a non-interactive shell with the --login option, it first reads and executes commands from the file /etc/profile, if that file exists.

gcv commented 2 years ago

Well, as long as setting julia-snail-executable in a .dir-locals.el or .dir-locals-2.el works for you, then I'm happy. But I had another idea, which, if it works, is worth documenting as a workaround.

What if, in your .bashrc file, you put something like this:

if [[ ! $(shopt -q login_shell) && $- != *i* && -f /etc/profile ]]; then
    . /etc/profile
fi

This sources /etc/profile if you're in a non-interactive non-login shell and if /etc/profile exists. Or, as a one-liner:

[[ ! $(shopt -q login_shell) && $- != *i* && -f /etc/profile ]] && . /etc/profile

I couldn't quite test it, because on my test environment /etc/profile gets sourced by non-login non-interactive shells (if I run ssh bashuser@myhost 'echo $__ETC_PROFILE_SOURCED' it prints 1, which is set in /etc/profile — no clue what's going on, since this behavior contradicts the bash man page; maybe a version difference).

PS: I can only shake my head at POSIX shells in general and bash in particular. The check for an interactive shell is especially a thing of beauty.

danielmatz commented 2 years ago

Wow! That snippet is wild, but it works! Snail can now find my remote Julia installation.

Unfortunately, I now encounter a new error:

julia> JuliaSnail.start(10011); # please wait, time-to-first-plot...
ERROR: IOError: listen: address already in use (EADDRINUSE)

I've tried changing julia-snail-port to several different values, and I always get the same error. I can use lsof -i :10011 to see the SSH process is indeed listening to that port.

gcv commented 2 years ago

That error means port 10011 is in use on the remote host. Kill Snail and all tramp sessions. ssh into the remote host, and run ps auwwwx | grep -i julia and see if there's a stray Julia process hanging out?

When you say that you tried changing julia-snail-port to different values but get the same error, does that mean the port in the JuliaSnail.start call is always 10011, or does it change to match julia-snail-port?

Hmm, I've actually been assuming that your cluster runs a recent Linux and OpenSSH combination. Can you please confirm that? uname -a and ssh -V are a good start. If Linux, what distribution?

danielmatz commented 2 years ago

That error means port 10011 is in use on the remote host. Kill Snail and all tramp sessions. ssh into the remote host, and run ps auwwwx | grep -i julia and see if there's a stray Julia process hanging out?

I restarted Emacs entirely, checked for stray Julia processes, and still got the same error.

When you say that you tried changing julia-snail-port to different values but get the same error, does that mean the port in the JuliaSnail.start call is always 10011, or does it change to match julia-snail-port?

Yes, sorry, the JuliaSnail.start command does always reflect the port I set. I also tried playing around with julia-snail-remote-port, which gets me past that Julia error, but then I get an Emacs message that it failed to connect to the Snail server.

Hmm, I've actually been assuming that your cluster runs a recent Linux and OpenSSH combination. Can you please confirm that? uname -a and ssh -V are a good start. If Linux, what distribution?

I believe we use CentOS.

uname -a:

Linux myhost 3.10.0-1160.25.1.el7.x86_64 #1 SMP Wed Apr 28 21:49:45 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

ssh -V:

OpenSSH_7.4p1, OpenSSL 1.0.2k-fips  26 Jan 2017

gcv commented 2 years ago

Did a remote REPL ever work for you? It sounded from https://github.com/gcv/julia-snail/issues/54#issuecomment-894874309 like it did, but now it’s sounding like it never works.

gcv commented 2 years ago

What if you set both julia-snail-port and julia-snail-remote-port to the same value, but not 10011 (like 21138)?

danielmatz commented 2 years ago

Did a remote REPL ever work for you? It sounded from #54 (comment) like it did, but now it’s sounding like it never works.

No, it never did work for me. I was just doing M-x compile and running Julia like a scripting language on remote hosts before.

What if you set both julia-snail-port and julia-snail-remote-port to the same value, but not 10011 (like 21138)?

I tried that, but no luck.

gcv commented 2 years ago

While digging for reasons this problem occurs, I found another potential bug which affects Snail working with some (newer?) versions of emacs-libvterm and which may cause ssh invocations to misfire. It's now fixed in the latest MELPA build, so please update the julia-snail package, restart Emacs, and see if you still have the problem.

If that doesn't work, let's get to some real debugging. We need to get Emacs out of the picture and understand what happens with your ssh tunnel. You will use netcat (nc) to send commands from the local machine (client) to the Snail server on the cluster. (You may have to install netcat from your package manager.)

Clean up all stray julia instances on the remote host.
Shut down all ssh connections from your local machine to the remote host.
Copy JuliaSnail.jl, Project.toml, and Manifest.toml from the Snail package (or directly from GitHub) to some directory on your remote host.
Start the ssh tunnel: ssh -t -L 10069:localhost:10099 remotehost /path/to/julia/binary -L /path/to/JuliaSnail.jl/on/remote/host
When the Julia REPL pops up, run JuliaSnail.start(10099) and wait for the prompt.
From your local machine, run this:
```
echo '(reqid = "abcd1234", ns = [:Main], code = "println(\"hello world\")")' | nc localhost 10069
```
This should print hello world to your Julia REPL and error out with a message like IOError: stream is closed or unusable.

If the Snail server fails to start on the remote host with the EADDRINUSE error, then replace 10099 with a different port number in both the ssh tunnel call and the JuliaSnail.start call. If that still does not work, something prevents you from opening up server sockets on the remote host, and you should ask your system administration staff for an explanation.

If the Snail server starts but there is no output in the Julia REPL from the netcat call, your tunnel is not being set up correctly. Maybe it's your local machine, maybe it's a firewall, maybe it's configuration, and maybe it's something else. Since I have no way to reproduce this situation, I cannot help any further.

If everything works without Emacs, but does not work inside Emacs, then maybe your Snail installation is broken. Blow it away completely (delete from disk), reinstall from MELPA or GitHub, and restart Emacs.

danielmatz commented 2 years ago

Woohoo! I pulled your latest bug fix with straight and it works! Thank you again for your help tracking down my issues. And thank you again for Snail!

danielmatz commented 1 year ago

This issue has returned for me. I get the following output when I try to start snail remotely:

Starting Julia process and loading Snail...
if: The vterm buffer is inactive; double-check julia-snail-executable path

Creating a .dir-locals.el file to set julia-snail-executable doesn't help.

I tried to follow your debug steps above, and I can get through to step 6. When I run the echo command, I get this back locally:

(julia-snail--response-success "abcd1234" nil)

But the remote process prints the following out in an infinite loop:

JuliaSnail: something broke: type Nothing has no field redid

Are there any changes to the debug steps I can try? Thanks!

gcv commented 1 year ago

That looks like a broken Snail installation on the remote host. You're absolutely certain the error says no field redid? Not reqid?

danielmatz commented 1 year ago

You are right; it says reqid. I think autocorrect got me... sorry about that.

gcv commented 1 year ago

I reproduced the problem. 👀

gcv commented 1 year ago

Something changed in the network IO code of Julia 1.8. While I figure out WTF broke between Julia versions and adapt Snail to deal with it, could you please try 1.7.x and see if that works for you?

danielmatz commented 1 year ago

I was able to test with 1.7.2 on my remote system, and it does indeed work.

gcv commented 1 year ago

Opening a separate ticket to track the new problem: https://github.com/gcv/julia-snail/issues/120

gcv / julia-snail

Unable to start a remote REPL #54