abailly / jepsen-vagrant

Vagrant box for running jepsen tests
12 stars 6 forks source link

SSH'ing into nodes fails out of the box for all tests. #6

Open hausdorff opened 8 years ago

hausdorff commented 8 years ago

I don't know clojure, and I've never used Vagrant before, so apologies if there is something simple I'm missing.

All of the tests seem to fail out of the box. When you run them you end up with something approximating the following:

vagrant@jepsen:/jepsen/jepsen/mongodb$ lein test
WARNING: run! already refers to: #'clojure.core/run! in namespace: jepsen.core, being replaced by: #'jepsen.core/run!
WARNING: run! already refers to: #'clojure.core/run! in namespace: jepsen.tests, being replaced by: #'jepsen.core/run!
WARNING: run! already refers to: #'clojure.core/run! in namespace: jepsen.mongodb.core-test, being replaced by: #'jepsen.mongodb.core-test/run!

lein test jepsen.mongodb.core-test

lein test :only jepsen.mongodb.core-test/document-cas-majority-test

ERROR in (document-cas-majority-test) (Session.java:512)
Uncaught exception, not in assertion.
expected: nil
  actual: com.jcraft.jsch.JSchException: Auth fail
 at com.jcraft.jsch.Session.connect (Session.java:512)
    com.jcraft.jsch.Session.connect (Session.java:183)
    clj_ssh.ssh$eval4504$fn__4505.invoke (ssh.clj:118)
    clj_ssh.ssh.protocols$eval4430$fn__4453$G__4415__4462.invoke (protocols.clj:4)
    clj_ssh.ssh$connect.invoke (ssh.clj:401)
    jepsen.control$session.invoke (control.clj:197)
    clojure.lang.AFn.applyToHelper (AFn.java:154)
    clojure.lang.AFn.applyTo (AFn.java:144)
    clojure.core$apply.invoke (core.clj:630)
    clojure.core$with_bindings_STAR_.doInvoke (core.clj:1868)
    clojure.lang.RestFn.applyTo (RestFn.java:142)
    clojure.core$apply.invoke (core.clj:634)
    clojure.core$bound_fn_STAR_$fn__4439.doInvoke (core.clj:1890)
    clojure.lang.RestFn.applyTo (RestFn.java:137)
    clojure.core$apply.invoke (core.clj:630)
    jepsen.core$fcatch$wrapper__6702.doInvoke (core.clj:54)
    clojure.lang.RestFn.invoke (RestFn.java:408)
    clojure.core$pmap$fn__6744$fn__6745.invoke (core.clj:6729)
    clojure.core$binding_conveyor_fn$fn__4444.invoke (core.clj:1916)
    clojure.lang.AFn.call (AFn.java:18)
    java.util.concurrent.FutureTask.run (FutureTask.java:266)
    java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1142)
    java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:617)
    java.lang.Thread.run (Thread.java:745)

Ran 1 tests containing 1 assertions.
0 failures, 1 errors.
Tests failed.

Interestingly, I am able to ssh root@n1 (for example), so it seems like perhaps some ssh daemon somewhere is not talking to our process correctly.

When I crack open lein repl and do something like (doto (ssh/session (ssh/ssh-agent {}) "n1" {:username "root" :password "root" :port 22 :strict-host-key-checking :yes }) (ssh/connect)) I (perhaps obviously) get the same error: JSchException Auth fail com.jcraft.jsch.Session.connect (Session.java:512)

But, when I replace that name with the IP of the underlying container, I get a different error: (doto (ssh/session (ssh/ssh-agent {}) "192.168.122.11" {:username "root" :password "root" :port 22 :strict-host-key-checking :yes }) (ssh/connect)) results in JSchException reject HostKey: 192.168.122.11 com.jcraft.jsch.Session.checkHost (Session.java:771)

I had a look in /var/log/auth.log but these incidents don't seem to be logged.

Do you have any ideas? I am unfortunately a complete networking noob so I'm not sure where else to look.

abailly commented 8 years ago

Hello, Thanks for your interest in this project. I have not touched it in a while so it may have suffered from bitrot. IIRC you need to modify jepsen's source to set authentication right: In the REAME there are a number of troubleshooting hints about authentication issues. Have you tried them?

hausdorff commented 8 years ago

Yep, I sure have. I didn't see anything about modifying source, but I did a bunch of other stuff, like clearing out the known hosts file and re-populate it with the correct, un-hashed keys.

The second of the errors above is happening, btw, because the IP address is not in the known hosts file. But the fact that I am having trouble even resolving the name n1 is hard to debug, because I just don't know anything about networking. If you could just point me in the right general direction, I could do the rest of the work myself.

abailly commented 8 years ago

Hmm, I think this SO question could be a good start: http://stackoverflow.com/questions/28621167/unable-to-run-jepsen-test-for-either-elasticsearch-or-rabbitmq

I managed to get it working by tweaking directly the jepsen/control.clj file, but this is definitely not a smooth process. You can authorize all hosts keys this way: for i in 1 2 3 4 5; do ssh-keyscan -t rsa n${i}; done >> ~/.ssh/known_hosts. You can have a look at each hosts' /var/log/auth.log to check what's failing in authentication: There is a debug flag you can tweak on each sshd.

I will have a look this evening (CET).

hausdorff commented 8 years ago

Yep, that is a good SO answer, and I did all those things.

Instead of trying to change the code itself, though, I just cracked open the lein repl and tried to call directly the clj-ssh code that control.clj is using to open the SSH connection. For a variety of values, this doesn't work, which seems to suggest that it is actually the way that the networking is configured in the vagrant container -- which is why I reported the bug to you and not Kyle. :)

Perhaps the DNS/DHCP is configured incorrectly? Other thoughts?

abailly commented 8 years ago

Might be, but apparently you can log in when authorizing host keys from lein repl right ?

Arnaud Bailly

twitter: abailly skype: arnaud-bailly linkedin: http://fr.linkedin.com/in/arnaudbailly/

On Sun, Feb 28, 2016 at 8:05 PM, Alex Clemmer notifications@github.com wrote:

Yep, that is a good SO answer, and I did all those things.

Instead of trying to change the code itself, though, I just cracked open the lein repl and tried to call directly the clj-ssh code that control.clj is using to open the SSH connection. For a variety of values, this doesn't work, which seems to suggest that it is actually the way that the networking is configured in the vagrant container -- which is why I reported the bug to you and not Kyle. :)

Perhaps the DNS/DHCP is configured incorrectly? Other thoughts?

— Reply to this email directly or view it on GitHub https://github.com/abailly/jepsen-vagrant/issues/6#issuecomment-189923546 .

hausdorff commented 8 years ago

I have actually never successfully logged in from lein repl. Worse, the auth.log is not reporting SSH errors when I try, which would suggest that the problem is the values in the known_hosts file are wrong. But, I did complete the ssh-keyscan steps above, so I'm not sure how that could be true.

hausdorff commented 8 years ago

It embarrasses me to ask (since it indicates a pretty thorough lack of networking knowledge :) ) but perhaps I have to restart some daemon after I delete the entries out of the known_hosts and replace them with the un-hashed ones?

abailly commented 8 years ago

Can you log in from console ? Le 28 févr. 2016 20:09, "Alex Clemmer" notifications@github.com a écrit :

I have actually never successfully logged in from lein repl. Worse, the auth.log is not reporting SSH errors when I try, which would suggest that the problem is the values in the known_hosts file are wrong. But, I did complete the ssh-keyscan steps above, so I'm not sure how that could be true.

— Reply to this email directly or view it on GitHub https://github.com/abailly/jepsen-vagrant/issues/6#issuecomment-189923892 .

hausdorff commented 8 years ago

Yep, I can. ssh root@n1 authorizes with the key rather than the password though. Not sure if that matters.

hausdorff commented 8 years ago

Does lein test work for you, btw?

abailly commented 8 years ago

Could matter yes. IIRC clojure code does not understand using key so it might the case that password is incorrect.

Can you try loging in with password ? Le 28 févr. 2016 20:14, "Alex Clemmer" notifications@github.com a écrit :

Yep, I can. ssh root@n1 authorizes with the key rather than the password though. Not sure if that matters.

— Reply to this email directly or view it on GitHub https://github.com/abailly/jepsen-vagrant/issues/6#issuecomment-189924248 .

hausdorff commented 8 years ago

Ah. I thought that root@n1's password should be root.

Based on the following, I don't think it is:

vagrant@jepsen:~$ ssh -o PreferredAuthentications=password -o PubkeyAuthentication=no root@n1
root@n1's password:
Permission denied, please try again.
root@n1's password:

I thought that the password should be root.

When I chroot /var/lib/lxc/n1/rootfs and attempt to change the password to root, the above still doesn't work. Hmmmmmm.

abailly commented 8 years ago

Yes, I managed to make lein test be successful :-)

I am sorry but I cannot debug right now. There is a docker Jepsen floating around, maybe you would have better luck with it ? Le 28 févr. 2016 20:24, "Alex Clemmer" notifications@github.com a écrit :

Ah. I thought that root@n1's password should be root.

Based on the following, I don't think it is:

vagrant@jepsen:~$ ssh -o PreferredAuthentications=password -o PubkeyAuthentication=no root@n1 root@n1's password: Permission denied, please try again. root@n1's password:

I thought that the password should be root.

When I chroot /var/lib/lxc/n1/rootfs and attempt to change the password to root, the above still doesn't work. Hmmmmmm.

— Reply to this email directly or view it on GitHub https://github.com/abailly/jepsen-vagrant/issues/6#issuecomment-189927332 .

hausdorff commented 8 years ago

Ok, I'll report back if I can get this to work. Thanks for your time!

abailly commented 8 years ago

Did not do much ! Jsch which is used for ssh by jeosen is aging... Le 28 févr. 2016 20:32, "Alex Clemmer" notifications@github.com a écrit :

Ok, I'll report back if I can get this to work. Thanks for your time!

— Reply to this email directly or view it on GitHub https://github.com/abailly/jepsen-vagrant/issues/6#issuecomment-189928947 .