Closed Voileexperiments closed 7 years ago
I also got this while doing Walk-up Stairs in Python. I only got this once for sample tests.
@jhoffner is this a feature of recently updated #runner_frame
to show the details of Server Execution Error
? I don't remember seeing stderr
with Server Execution Error
before.
/var/run/docker.sock
.How often is this happening? We run on AWS ECS and Amazon in their infinite wisdom decided to auto-update the ECS instance's Docker version on machine start automatically.
Docker was updated from 1.12.6 to 17.03.1-ce recently. It looks like at this moment only one of our server instances are still running the older version, so they probably cycle out recently and this is why this started happening. Since we use our own scheduling engine with Docker at a low level, the new version is likely causing a new edge case to pop up that the scheduler needs to account for.
If this is only happening sporadically I'll probably wait until I'm back in the office on Tuesday to dive into this.
@kazk I deployed some updates to the error reporting before I left for vacation.Though in this case it looks like its reporting STDOUT from the CLI, not the spawned process.
How often is this happening?
Not often for me. I had this once and couldn't reproduce #978. @Voileexperiments experienced #978 frequently, but it's not happening anymore.
auto-update the ECS instance's Docker version on machine start automatically.
Is there any way to stop this? Automatically updating Docker sounds painful :(
Docker was updated from 1.12.6 to 17.03.1-ce recently.
They changed the version scheme to YY.MM
and now it's released monthly for edge and quarterly for stable. I added some links to relevant releases at the end of this comment for convenience.
Since we use our own scheduling engine with Docker at a low level, the new version is likely causing a new edge case to pop up that the scheduler needs to account for.
How is this implemented? Are you using Docker Engine API? There're versioned APIs and supposed to be backward compatible.
Though in this case it looks like its reporting STDOUT from the CLI, not the spawned process.
Yeah, looks like this one failed because the container with that id
existed, but it wasn't running.
@jhoffner I just got this again.
I was able to inspect 2 responses after the first one.
I'm not sure what workerId
is (instance id?), but it was both w-6948
.
{
"type": "execution error",
"message": "The server was unable to process the request. Please try again.",
"stderr": "Error response from daemon: Container ddec9d7c556b5b51dea287ff53f94b1c7ddd9cbb9ad5364b8d9d60c5aaae7605 is not running\n",
"exitCode": 1,
"exitSignal": "",
"setupTime": null,
"totalTime": 476,
"receivedAt": 1497034541143,
"responseAt": 1497034541619,
"languageVersion": "6.x",
"workerId": "w-6948",
"result": {
"serverError": true,
"completed": false,
"output": [],
"successMode": "assertions",
"passed": 0,
"failed": 0,
"errors": 0,
"assertions": {
"passed": 0,
"failed": 0,
"hidden": {
"passed": 0,
"failed": 0
}
},
"specs": {
"passed": 0,
"failed": 0,
"hidden": {
"passed": 0,
"failed": 0
}
},
"unweighted": {
"passed": 0,
"failed": 0
},
"weighted": {
"passed": 0,
"failed": 0
},
"timedOut": false
},
"token": "[TOKEN]"
}
I don't know about frequency, it's not too often. But I was able to see this few times within 15 minutes by submitting many times.
I've been getting this error ALL day, seriously. Every time I submit either a test or a final solution.
"Error response from daemon: Container 8dc259cec3295d64a2a43385b34e4756e48fe711a8698d6def8032f9a852450d is not running"
I tried again logged out (no RED) and saved network logs as HAR
using Chrome DevTools, then examined this using jq
, 13/61 were errors.
# def runs: .log.entries | .[] | select(.request.method == "POST" and .request.url == "https://runner.codewars.com/run");
# def rjson: .response.content.text | fromjson;
# Total Requests
# [runs] | length
61
# Errors
# [runs | rjson | select(.type == "execution error")] | length
13
# workerId of 'execution error'
# [runs | rjson | select(.type == "execution error") | .workerId] | sort | unique
[
"w-55226",
"w-57729",
"w-66504",
"w-75614",
"w-96997"
]
# workerId of 'execution success'
# [runs | rjson | select(.type == "execution success") | .workerId] | sort | unique
[
null, # some responses were missing `workerId`
"w-20771",
"w-20881",
"w-23913",
"w-28982",
"w-34863",
"w-42848",
"w-56175",
"w-57966",
"w-63153",
"w-70013",
"w-76801",
"w-87379"
]
@luisgc91 is there a specific kata/language you are training on? I still haven't reproduced but I can clearly see from the error logs that this is affecting about a 6th of all requests, which matches what @kazk saw from his test run. All instances are now transitioned over to the latest Docker version so its not a mixed version issue.
Are you using Docker Engine API
We are mostly using the Dockerode library. I still need to investigate which API it is targeting. In some cases the docker cli is called directly.
Is there any way to stop this? Automatically updating Docker sounds painful :(
Yes, this needs to be stopped. I haven't figured out a way yet but I haven't investigated completely.
@jhoffner
We are mostly using the Dockerode library.
Are you using version
option? Dockerode's README has this:
var Docker = require('dockerode');
var docker5 = new Docker({
// omitted
version: 'v1.25' // required when Docker >= v1.13, https://docs.docker.com/engine/api/version-history/
});
Looks like 1.12.6
works fine without this, but 17.03.1-ce
needs it?
Docker v1.13 has Engine API v1.25 which introduced the following change:
- The API version is now required in all API calls. Instead of just requesting, for example, the URL
/containers/json
, you must now request/v1.25/containers/json
.
@jhoffner Are you using Docker for AWS?
Stable is released quarterly and is for users that want an easier-to-maintain release pace.
If Dockerode's version
option fixes this issue, auto-updates might not be as painful because versioned engine APIs are advertised as backward compatible. And automatically updating to latest stable version is good for security reasons.
Looks like this issu is only happening for Python and Javascript oddly enough. I can't find logs of any other language and when I try using Ruby I can't reproduce.
Are you using Docker for AWS?
We are using http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html. Though I haven't updated the underlying image in a while. I think the ecs-init package has docker updates baked in, which is what held me up from finding a quick remedy for preventing auto-updates - if I recall correctly. We create and destroy so many one-time use containers in the coarse of a day, I think auto-updates will always be a risk. I doubt Docker is stress testing our use-case, and there are already so many edge case recovery conditions coded into the custom scheduler engine to deal with odd container start/stop/exec/remove conditions and status codes.
If Dockerode's version option fixes this issue
I've added this and deployed but the issues persist.
@jhoffner I am training on Javascript. The problem persisted in almost all katas.
@jhoffner
According to the docs, ECS Optimized AMI is based on Amazon Linux AMI and applies security updates automatically on boot.
I think repo_upgrade
field in /etc/cloud/cloud.cfg
controls it.
To turn it off:
# /etc/cloud/cloud.cfg
# repo_upgrade: security
repo_upgrade: none
Maybe let them email you so you can update manually for security updates?
The Amazon ECS-optimized AMI receives regular updates for agent changes, Docker version updates, and Linux kernel security updates. You can subscribe to the AMI update Amazon SNS topic to receive notifications when a new Amazon ECS–optimized AMI is available. Subscribing to Amazon ECS–Optimized AMI Update Notifications
Looks like this issu is only happening for Python and Javascript oddly enough. I can't find logs of any other language and when I try using Ruby I can't reproduce.
Yeah, I don't know why it's only affecting those two. I tried few other languages and couldn't reproduce either. JavaScript is still failing. Python is failing much less than earlier.
// jq -f analyze.jq < multiply-js.har | jq '{total_count, error_count}'
{"total_count": 60, "error_count": 8}
// jq -f analyze.jq < multiply-py.har | jq '{total_count, error_count}'
{"total_count": 56, "error_count": 0}
// jq -f analyze.jq < multiply-rb.har | jq '{total_count, error_count}'
{"total_count": 62, "error_count": 0}
The error is getting worse ... Everytime I submit either a test or an attempt, I have to try at least 5 times. At the moment I am with the kata "Find the unique number", but it doesn't happen only in just one of them.
Error response from daemon: Container b4d5dfd844e5b65f2a9e085fa0199856ba373e7f1cf3097bfb813ef414388427 is not running
If there's any way I might help, please tell me.
@luisgc91 The issue flared up around the time that you logged this issue, and stabilized back down to about 20% of the time (for JS) shortly after, according to my logs. I deployed a patch last night that seems to have reduced this issue down to less than 1% of the time. I would prefer it to be 0% of the time so I'm going to keep this open for now and continue to make improvements on this, but as of now it looks like the issue is much more manageable.
For the record, this is also happening for Haskell, not just JS or Python.
Also happened once for SQL. Maybe the popularity of the language (the number of users submitting at the time) correlates with the frequency of this issue happening?
Yeah I've noticed that this is not specific to a language, which makes me feel better because otherwise that would make absolutely no sense.
Across all requests on avg 2.5% are timing out (for reasons not likely due to bad user code), and 2.5% are failing due to this error. 5% failure rate is not an acceptable range, it's something I'm trying to carve out some time for so that I can take a deep dive.
CW is giving me a new form of error:
(It seems to only happen on this kata)
This seems to have never happened again for a few weeks.
Closing.
I'm doing this kata and the runner frequently gives me this error when I try to submit my code (both example and actual tests):