codewars / codewars.com

Issue tracker for Codewars
https://www.codewars.com
BSD 2-Clause "Simplified" License
2.09k stars 220 forks source link

Server Execution Error with STDERR "Error response from daemon: Container {id} is not running" #993

Closed Voileexperiments closed 7 years ago

Voileexperiments commented 7 years ago

I'm doing this kata and the runner frequently gives me this error when I try to submit my code (both example and actual tests):

image

kazk commented 7 years ago

I also got this while doing Walk-up Stairs in Python. I only got this once for sample tests.

image


@jhoffner is this a feature of recently updated #runner_frame to show the details of Server Execution Error? I don't remember seeing stderr with Server Execution Error before.

978 is similar issue showing a message about /var/run/docker.sock.

jhoffner commented 7 years ago

How often is this happening? We run on AWS ECS and Amazon in their infinite wisdom decided to auto-update the ECS instance's Docker version on machine start automatically.

Docker was updated from 1.12.6 to 17.03.1-ce recently. It looks like at this moment only one of our server instances are still running the older version, so they probably cycle out recently and this is why this started happening. Since we use our own scheduling engine with Docker at a low level, the new version is likely causing a new edge case to pop up that the scheduler needs to account for.

If this is only happening sporadically I'll probably wait until I'm back in the office on Tuesday to dive into this.

jhoffner commented 7 years ago

@kazk I deployed some updates to the error reporting before I left for vacation.Though in this case it looks like its reporting STDOUT from the CLI, not the spawned process.

kazk commented 7 years ago

How often is this happening?

Not often for me. I had this once and couldn't reproduce #978. @Voileexperiments experienced #978 frequently, but it's not happening anymore.

auto-update the ECS instance's Docker version on machine start automatically.

Is there any way to stop this? Automatically updating Docker sounds painful :(

Docker was updated from 1.12.6 to 17.03.1-ce recently.

They changed the version scheme to YY.MM and now it's released monthly for edge and quarterly for stable. I added some links to relevant releases at the end of this comment for convenience.

Since we use our own scheduling engine with Docker at a low level, the new version is likely causing a new edge case to pop up that the scheduler needs to account for.

How is this implemented? Are you using Docker Engine API? There're versioned APIs and supposed to be backward compatible.


Though in this case it looks like its reporting STDOUT from the CLI, not the spawned process.

Yeah, looks like this one failed because the container with that id existed, but it wasn't running.


kazk commented 7 years ago

@jhoffner I just got this again. I was able to inspect 2 responses after the first one. I'm not sure what workerId is (instance id?), but it was both w-6948.

{
  "type": "execution error",
  "message": "The server was unable to process the request. Please try again.",
  "stderr": "Error response from daemon: Container ddec9d7c556b5b51dea287ff53f94b1c7ddd9cbb9ad5364b8d9d60c5aaae7605 is not running\n",
  "exitCode": 1,
  "exitSignal": "",
  "setupTime": null,
  "totalTime": 476,
  "receivedAt": 1497034541143,
  "responseAt": 1497034541619,
  "languageVersion": "6.x",
  "workerId": "w-6948",
  "result": {
    "serverError": true,
    "completed": false,
    "output": [],
    "successMode": "assertions",
    "passed": 0,
    "failed": 0,
    "errors": 0,
    "assertions": {
      "passed": 0,
      "failed": 0,
      "hidden": {
        "passed": 0,
        "failed": 0
      }
    },
    "specs": {
      "passed": 0,
      "failed": 0,
      "hidden": {
        "passed": 0,
        "failed": 0
      }
    },
    "unweighted": {
      "passed": 0,
      "failed": 0
    },
    "weighted": {
      "passed": 0,
      "failed": 0
    },
    "timedOut": false
  },
  "token": "[TOKEN]"
}

I don't know about frequency, it's not too often. But I was able to see this few times within 15 minutes by submitting many times.

twypsy commented 7 years ago

I've been getting this error ALL day, seriously. Every time I submit either a test or a final solution.

"Error response from daemon: Container 8dc259cec3295d64a2a43385b34e4756e48fe711a8698d6def8032f9a852450d is not running"

kazk commented 7 years ago

I tried again logged out (no RED) and saved network logs as HAR using Chrome DevTools, then examined this using jq, 13/61 were errors.

# def runs: .log.entries | .[] | select(.request.method == "POST" and .request.url == "https://runner.codewars.com/run");
# def rjson: .response.content.text | fromjson;

# Total Requests
# [runs] | length
61
# Errors
# [runs | rjson | select(.type == "execution error")] | length
13

# workerId of 'execution error'
# [runs | rjson | select(.type == "execution error") | .workerId] | sort | unique
[
  "w-55226",
  "w-57729",
  "w-66504",
  "w-75614",
  "w-96997"
]

# workerId of 'execution success'
# [runs | rjson | select(.type == "execution success") | .workerId] | sort | unique
[
  null, # some responses were missing `workerId`
  "w-20771",
  "w-20881",
  "w-23913",
  "w-28982",
  "w-34863",
  "w-42848",
  "w-56175",
  "w-57966",
  "w-63153",
  "w-70013",
  "w-76801",
  "w-87379"
]
jhoffner commented 7 years ago

@luisgc91 is there a specific kata/language you are training on? I still haven't reproduced but I can clearly see from the error logs that this is affecting about a 6th of all requests, which matches what @kazk saw from his test run. All instances are now transitioned over to the latest Docker version so its not a mixed version issue.

Are you using Docker Engine API

We are mostly using the Dockerode library. I still need to investigate which API it is targeting. In some cases the docker cli is called directly.

Is there any way to stop this? Automatically updating Docker sounds painful :(

Yes, this needs to be stopped. I haven't figured out a way yet but I haven't investigated completely.

kazk commented 7 years ago

@jhoffner

We are mostly using the Dockerode library.

Are you using version option? Dockerode's README has this:

var Docker = require('dockerode');
var docker5 = new Docker({
  // omitted
  version: 'v1.25' // required when Docker >= v1.13, https://docs.docker.com/engine/api/version-history/
});

Looks like 1.12.6 works fine without this, but 17.03.1-ce needs it?

kazk commented 7 years ago

Docker v1.13 has Engine API v1.25 which introduced the following change:

  • The API version is now required in all API calls. Instead of just requesting, for example, the URL /containers/json, you must now request /v1.25/containers/json.

Engine API version history (v1.25)

kazk commented 7 years ago

@jhoffner Are you using Docker for AWS?

Stable is released quarterly and is for users that want an easier-to-maintain release pace.

If Dockerode's version option fixes this issue, auto-updates might not be as painful because versioned engine APIs are advertised as backward compatible. And automatically updating to latest stable version is good for security reasons.

jhoffner commented 7 years ago

Looks like this issu is only happening for Python and Javascript oddly enough. I can't find logs of any other language and when I try using Ruby I can't reproduce.

Are you using Docker for AWS?

We are using http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html. Though I haven't updated the underlying image in a while. I think the ecs-init package has docker updates baked in, which is what held me up from finding a quick remedy for preventing auto-updates - if I recall correctly. We create and destroy so many one-time use containers in the coarse of a day, I think auto-updates will always be a risk. I doubt Docker is stress testing our use-case, and there are already so many edge case recovery conditions coded into the custom scheduler engine to deal with odd container start/stop/exec/remove conditions and status codes.

If Dockerode's version option fixes this issue

I've added this and deployed but the issues persist.

twypsy commented 7 years ago

@jhoffner I am training on Javascript. The problem persisted in almost all katas.

kazk commented 7 years ago

@jhoffner

According to the docs, ECS Optimized AMI is based on Amazon Linux AMI and applies security updates automatically on boot. I think repo_upgrade field in /etc/cloud/cloud.cfg controls it. To turn it off:

# /etc/cloud/cloud.cfg
# repo_upgrade: security
repo_upgrade: none

See Q: How do I disable the automatic installation of critical and important security updates on initial launch?

Maybe let them email you so you can update manually for security updates?

The Amazon ECS-optimized AMI receives regular updates for agent changes, Docker version updates, and Linux kernel security updates. You can subscribe to the AMI update Amazon SNS topic to receive notifications when a new Amazon ECS–optimized AMI is available. Subscribing to Amazon ECS–Optimized AMI Update Notifications


Looks like this issu is only happening for Python and Javascript oddly enough. I can't find logs of any other language and when I try using Ruby I can't reproduce.

Yeah, I don't know why it's only affecting those two. I tried few other languages and couldn't reproduce either. JavaScript is still failing. Python is failing much less than earlier.

// jq -f analyze.jq < multiply-js.har | jq '{total_count, error_count}'
{"total_count": 60, "error_count": 8}
// jq -f analyze.jq < multiply-py.har | jq '{total_count, error_count}'
{"total_count": 56, "error_count": 0}
// jq -f analyze.jq < multiply-rb.har | jq '{total_count, error_count}'
{"total_count": 62, "error_count": 0}
twypsy commented 7 years ago

The error is getting worse ... Everytime I submit either a test or an attempt, I have to try at least 5 times. At the moment I am with the kata "Find the unique number", but it doesn't happen only in just one of them.

Error response from daemon: Container b4d5dfd844e5b65f2a9e085fa0199856ba373e7f1cf3097bfb813ef414388427 is not running

If there's any way I might help, please tell me.

jhoffner commented 7 years ago

@luisgc91 The issue flared up around the time that you logged this issue, and stabilized back down to about 20% of the time (for JS) shortly after, according to my logs. I deployed a patch last night that seems to have reduced this issue down to less than 1% of the time. I would prefer it to be 0% of the time so I'm going to keep this open for now and continue to make improvements on this, but as of now it looks like the issue is much more manageable.

Voileexperiments commented 7 years ago

For the record, this is also happening for Haskell, not just JS or Python.

kazk commented 7 years ago

Also happened once for SQL. Maybe the popularity of the language (the number of users submitting at the time) correlates with the frequency of this issue happening?

jhoffner commented 7 years ago

Yeah I've noticed that this is not specific to a language, which makes me feel better because otherwise that would make absolutely no sense.

Across all requests on avg 2.5% are timing out (for reasons not likely due to bad user code), and 2.5% are failing due to this error. 5% failure rate is not an acceptable range, it's something I'm trying to carve out some time for so that I can take a deep dive.

Voileexperiments commented 7 years ago

CW is giving me a new form of error:

image

(It seems to only happen on this kata)

Voileexperiments commented 7 years ago

This seems to have never happened again for a few weeks.

Closing.