dain / galaxy-server

Galaxy as been renamed and moved to Airship
https://github.com/airlift/airship
17 stars 7 forks source link

Timeout during upgrade #28

Open electrum opened 12 years ago

electrum commented 12 years ago

The first upgrade command timed out:

[23:00 ubuntu@i-64580004:~ prod] galaxy upgrade -b discovery-elb discovery-elb:1.2 @discovery-elb:1
uuid  host         machine     status   binary             config                      
0561  10.94.13.48  i-4c451d2c  STOPPED  discovery-elb:1.1  @discovery-elb:general:1.0

Are you sure you would like to UPGRADE these servers? [y/N] y

java.net.SocketTimeoutException: Read timed out

The second upgrade returned a weird error:

[23:00 ubuntu@i-64580004:~ prod] galaxy upgrade -b discovery-elb discovery-elb:1.2 @discovery-elb:1
uuid  host         machine     status   binary             config                      
0561  10.94.13.48  i-4c451d2c  STOPPED  discovery-elb:1.1  @discovery-elb:general:1.0

Are you sure you would like to UPGRADE these servers? [y/N] y

uuid  host         machine     status   binary             config                      
0561  10.94.13.48  i-4c451d2c  UNKNOWN  discovery-elb:1.1  @discovery-elb:general:1.0  UnexpectedResponseException{request=Request{uri=http://10.94.13.48:65000/v1/agent/slot/0561a95c-8c22-417e-963a-981b2ff9b3fb/assignment, method='PUT', headers={x-galaxy-agent-version=[b9bcdfa080fe634c57f41dd88c09542e], x-galaxy-slot-version=[21e7371f3c7e9c64628d44b964c456e2], Content-Type=[application/json]}, bodyGenerator=com.proofpoint.http.client.JsonBodyGenerator@15fd3c35}, statusCode=500, statusMessage='Could not obtain slot lock within 1000.00ms held by null thread is at  com.proofpoint.galaxy.agent.DeploymentSlot.lock(DeploymentSlot.java:346)   at com.proofpoint.galaxy.agent.DeploymentSlot.assign(DeploymentSlot.java:163)   at com.proofpoint.galaxy.agent.AssignmentResource.assign(AssignmentResource.java:70)   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)   at java.lang.reflect.Method.invoke(Method.java:597)   at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)   at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)   at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)   at com.sun.jers', headers={Content-Length=[10834], Content-Type=[text/html;charset=ISO-8859-1], Cache-Control=[must-revalidate,no-cache,no-store]}}

The third succeeded:

[23:00 ubuntu@i-64580004:~ prod] galaxy upgrade -b discovery-elb discovery-elb:1.2 @discovery-elb:1
uuid  host         machine     status   binary             config            
0561  10.94.13.48  i-4c451d2c  STOPPED  discovery-elb:1.2  @discovery-elb:1

Are you sure you would like to UPGRADE these servers? [y/N] y

uuid  host         machine     status   binary             config            
0561  10.94.13.48  i-4c451d2c  STOPPED  discovery-elb:1.2  @discovery-elb:1

The timeout might be caused by the Nexus proxy being slow. This was the first access for that artifact.

dain commented 12 years ago

For the first one, the request timed out in the client. For the second one, the agent timed out waiting for the slot lock, because it was still running the first upgrade request. If you look closely at the third request, the server was already at version 1.2 and you simply upgraded it to 1.2 again.

So all of the problems were caused by the first request taking a long time. This was most likely caused by downloading the binary into your nexus repo. The third command was fast binary was already in you nexus repo.

The real problem here is we timeout too aggressively for long running commands like install and stop, and we need transient states like "installing", "restarting" and "stopping", so the user knows what is going on.