AAROC / CODE-RADE

Website, documentation and such for the CODE-RADE project
http://www.africa-grid.org/CODE-RADE
Apache License 2.0
5 stars 5 forks source link

Jenkins update killed jobs #207

Closed brucellino closed 5 years ago

brucellino commented 6 years ago

I updated jenkins itself and some plugins last week, and now we can't launch containers :-1:

Typical errors are as such :

[10/19/17 08:16:44] SSH Launch of local-12675a22d1296 on 0.0.0.0 failed in 4 ms
Oct 19, 2017 8:16:45 AM hudson.plugins.sshslaves.verifiers.TrileadVersionSupportManager getTrileadSupport
WARNING: Could not create Trilead support class. Using legacy Trilead features
java.lang.ClassNotFoundException: hudson.plugins.sshslaves.verifiers.JenkinsTrilead9VersionSupport
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:560)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at hudson.plugins.sshslaves.verifiers.TrileadVersionSupportManager.createVersion9Instance(TrileadVersionSupportManager.java:52)
    at hudson.plugins.sshslaves.verifiers.TrileadVersionSupportManager.getTrileadSupport(TrileadVersionSupportManager.java:32)
    at hudson.plugins.sshslaves.verifiers.SshHostKeyVerificationStrategy.getPreferredKeyAlgorithms(SshHostKeyVerificationStrategy.java:68)
    at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:796)
    at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:792)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

This looks like it has to do with the sshslaves plugin, specifically the key verification strategy.

The hosts are of course docker images, and the ssh keys it's referring to here I'm not sure whether they are the user keys or the host ssh keys.

brucellino commented 6 years ago

There was a new version (1.22) of the ssh slaves available today. Updating that plugin.

brucellino commented 6 years ago

Well... from other [tickets](https://github.com/jenkinsci/docker-plugin/issues/130 , it seems that the TriLead thing is just a warning

The actual issue is instead, perhaps, related to the github plugin:

SEVERE: Error during callback
com.github.dockerjava.api.exception.NotModifiedException: 
    at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:97)
    at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:33)
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)
    at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
    at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:997)
    at io.netty.channel.epoll.EpollDomainSocketChannel$EpollDomainUnsafe.epollInReady(EpollDomainSocketChannel.java:138)
    at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:401)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:306)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    at java.lang.Thread.run(Thread.java:748)
brucellino commented 6 years ago

ok, I've narrowed this down further to a problem with the CEntOS 6 image. Turns out the "host" key has changed :

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
SHA256:dbD3hUQ7VrGyBH4TrAzrZ884lhTnTCSuWj2LLzG4YdE.
Please contact your system administrator.
Add correct host key in /home/ansible/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /home/ansible/.ssh/known_hosts:1
RSA host key for [172.17.0.2]:5200 has changed and you have requested strict checking.
Host key verification failed.
brucellino commented 6 years ago

So, two issues have arisen :

  1. The host key verification was failing (see previous comment)
  2. The nodes were being provisioned too slowly for jenkins to confirm connections to them, and it put them offline

I've updated the jenkins known_hosts and unset the host verification method (no verification). Hopefully adding some timeouts and delays to the creation of the containers will alleviate 2.

https://github.com/docker-java/docker-java/issues/98 and https://github.com/jenkinsci/docker-plugin/issues/57 seem to confirm this.

brucellino commented 6 years ago

Ok, quick update. I checked the ciphers allowed by the containers :

ssh jenkins@172.17.0.2 -p 5200 -Q cipher
3des-cbc
blowfish-cbc
cast128-cbc
arcfour
arcfour128
arcfour256
aes128-cbc
aes192-cbc
aes256-cbc
rijndael-cbc@lysator.liu.se
aes128-ctr
aes192-ctr
aes256-ctr
aes128-gcm@openssh.com
aes256-gcm@openssh.com
chacha20-poly1305@openssh.com

and for the key exchanges :

ssh jenkins@172.17.0.2 -p 5200 -Q kex
diffie-hellman-group1-sha1
diffie-hellman-group14-sha1
diffie-hellman-group-exchange-sha1
diffie-hellman-group-exchange-sha256
ecdh-sha2-nistp256
ecdh-sha2-nistp384
ecdh-sha2-nistp521
curve25519-sha256@libssh.org

There are some weak ciphers and key exchanges, but I don't think that this is the problem. Rather, it's the host key generation -

- name: generate host keys
  command: "ssh-keygen -f /etc/ssh/ssh_host_{{item }}_key -N '' -t {{ item }}"
  args:
    creates: "/etc/ssh/ssh_host_{{item }}_key"
  with_items:
    - rsa
    - dsa
    - ecdsa

in github.com/AAROC/CODE-RADE-Container (tasks/ssh-config.yml).

Probably the right thing to do is keep a list of host keys and then put them in the container, then let jenkins trust them somehow. :man_shrugging:

brucellino commented 5 years ago

solved by rebuild of containers.