Corymbia / eucalyptus

Eucalyptus Cloud-computing Platform
https://eucalyptus.cloud/
Other
118 stars 23 forks source link

Some component repeatedly causing an exception, and this is preventing instance startup #339

Open flyn-org opened 2 years ago

flyn-org commented 2 years ago

I am running Eucalyptus 5.1 on CentOS 7.9.2009. Something is causing Eucalyptus to enter a state where it will no longer start instances. Trying to start an instance places the instance into the "pending" state for about 20 minutes after which the instance stops. While this happens the message below is repeatedly written to the logs.

Restarting all of the Eucalyptus services on the computer does not seem to help. Rebooting does seem to restore Eucalyptus to a usable state, but eventually these symptoms reappear.

Fri Mar 4 21:26:07 2022 ERROR [NioServerHandler:web-services-worker-pool-7] [com.eucalyptus.ws.server.NioServerHandler.exceptionCaught(NioServerHandler.java):174] Internal Error: Invalid request : com.ctc.wstx.exc.WstxEOFException: Unexpected EOF in prolog
 at [row,col {unknown-source}]: [1,0]
Fri Mar 4 21:26:07 2022 ERROR [BroadcastNetworkInfoCallback:eucalyptus-bootstrap-callbacks-basiccallbackprocessor-worker-2329] [com.eucalyptus.cluster.callback.BroadcastNetworkInfoCallback.fireException(BroadcastNetworkInfoCallback.java):74] Error in network information broadcast: Action:ProblemAction Code:soapenv:Client Id:RelatesTo Error: Invalid request : com.ctc.wstx.exc.WstxEOFException: Unexpected EOF in prolog
 at [row,col {unknown-source}]: [1,0]Invalid request : com.ctc.wstx.exc.WstxEOFException: Unexpected EOF in prolog
 at [row,col {unknown-source}]: [1,0]
com.eucalyptus.ws.EucalyptusRemoteFault: Action:ProblemAction Code:soapenv:Client Id:RelatesTo Error: Invalid request : com.ctc.wstx.exc.WstxEOFException: Unexpected EOF in prolog
 at [row,col {unknown-source}]: [1,0]Invalid request : com.ctc.wstx.exc.WstxEOFException: Unexpected EOF in prolog
 at [row,col {unknown-source}]: [1,0]
 at [row,col {unknown-source}]: [1,0]Invalid request : com.ctc.wstx.exc.WstxEOFException: Unexpected EOF in prolog
 at [row,col {unknown-source}]: [1,0]
    at com.eucalyptus.ws.handlers.IoSoapHandler.perhapsFault(IoSoapHandler.java:175)
    at com.eucalyptus.ws.handlers.IoSoapHandler.channelRead(IoSoapHandler.java:81)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321)
    at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
    at com.eucalyptus.ws.handlers.IoSoapMarshallingHandler.channelRead(IoSoapMarshallingHandler.java:102)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321)
    at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
    at com.eucalyptus.ws.handlers.IoMessageWrapperHandler.channelRead(IoMessageWrapperHandler.java:58)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321)
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321)
    at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1070)
    at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:904)
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:411)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321)
    at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321)
    at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1280)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:890)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:564)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:505)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:419)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:391)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
    at java.base/java.lang.Thread.run(Thread.java:829)

nc.log contains entries like this, which we think might be related:

version (b3f3c403) and applied version ((null)) do not match (yet), waiting
obino commented 2 years ago

@flyn-org could you try the workaround in https://github.com/Corymbia/eucalyptus/issues/307 and report back? Thanks,

flyn-org commented 2 years ago

@obino, thank you, this looks promising. Here is what I did:

  1. Ran euctl bootstrap.webservices.http_max_chunk_bytes=153600. I was not sure if I needed to restart the Eucalyptus services, so I did not restart them.
  2. Began starting instances using euca-start-instances. I started one at a time, waiting for each to enter a running state.
  3. I made it to the last instance (number 51) before seeing the EOF-related exception above.
  4. I then ran euctl bootstrap.webservices.http_max_chunk_bytes=307200.
  5. The EOF-related exception stopped. In total, it fired seven times between step 3 and 4. I am not sure if step 4 caused the EOF-related exceptions to stop, or if it was a coincidence.
  6. I was then able to start the final instance.

There has been what seems to be some non-determinism mucking up our experiments, so I am hesitant to state conclusively whether this solved our problem. That said, it seems to help, and I am very thankful for the pointer. I will report back in a few days if things remain stable.