cyber-dojo-retired / runner-stateless

repo for the cyberdojo/runner_stateless image
https://cyber-dojo.org
BSD 2-Clause "Simplified" License
0 stars 2 forks source link

ulimit for cpu will make tests timeout prematurely in multi core environments #2

Closed jelmerk closed 7 years ago

jelmerk commented 7 years ago

Hi, I was trying to add support for scala to cyberdojo. You can find my effort here

And I ran into the following issue

Tests are supposed to time out after 10 seconds. But often the container running the tests would be killed way before the 10 second mark.

The problem I believe lies in line 122 of runner.rb where you find the following line

'--ulimit cpu=10:10', # max cpu time (seconds)

cpu ulimit limits the amount of cpu time to the given number of seconds. It takes a softlimit (after which the container is sent a SIGXCPU signal) followed by a SIGKILL when the hard limit is reached

However by default each container’s access to the host machine’s CPU cycles is unlimited So the container will often be killed before the 10 second mark. Which will lead to any number of cryptic error messages (in my case it killed the compiler)

One way to fix it would be to add the --cpus=1 flag, You can read about this here, this would make the behaviour predictable, but would make everything run slower. This much slower in-fact that i can't get scala to execute the tests in time when this setting is enabled. So I am open to other suggestions but can't think of any really good ones of the top of my head

JonJagger commented 7 years ago

Hi Jelmer, awesome stuff - thanks.

There are actually two bits of code in the runner that limit the test run to 10 seconds. The first is at line 122 as you mention, and is a limit that applies from "inside" the container. The second is at line 204, and is a limit that applies from "outside" the container. How about we simply increase the ulimit on line 122 and but leave line 204 as 10 seconds?

This is what you have already tried in your patch.

Before I do that though I'd like to understand... You write 1) ...by default each container’s access to the host machine’s CPU cycles is unlimited 2) So the container will often be killed before the 10 second mark. I don't follow why 1 is true. Doesn't the cpu-ulimit mean that 1 is not true? What am I missing?

As a further aside, in the past I have also tried to add support for scala to cyber-dojo. I'm afraid I had to give up since the test runs were very slow. Hopefully it is a lot better now.

Cheers Jon

On Sun, Nov 5, 2017 at 12:37 AM, Jelmer Kuperus notifications@github.com wrote:

Hi, I was trying to add support for scala to cyberdojo. You can find my effort here

https://github.com/jelmerk/scala-scalatest

And I ran into the following issue

Tests are supposed to time out after 10 seconds. But often the container running the tests would be killed way before the 10 second mark.

The problem I believe lies in line 122 of runner.rb ( https://github.com/cyber-dojo/runner_stateless/blob/ master/server/src/runner.rb#L122) where you find the following line

'--ulimit cpu=10:10', # max cpu time (seconds)

cpu ulimit limits the amount of cpu time to the given number of seconds. It takes a softlimit (after which the container is sent a SIGXCPU signal) followed by a SIGKILL when the hard limit is reached

However by default each container’s access to the host machine’s CPU cycles is unlimited So the container will often be killed fore the 10 second mark. Which will lead to any number of cryptic error messages (in my case it killed the compiler)

One way to fix it would be to add the --cpus=1 flag but this would make everything run slower. This much slower in-fact that i cant get scala to execute the tests in time when this setting is enabled. So I am open to other suggestions

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cyber-dojo/runner_stateless/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPY1iTPfNHg9hr8IVJhjvb41lkaeANGks5szQMxgaJpZM4QSNsH . {"api_version":"1.0","publisher":{"api_key":" 05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity": {"external_key":"github/cyber-dojo/runner_stateless","title" :"cyber-dojo/runner_stateless","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/ 143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png"," avatar_image_url":"https://cloud.githubusercontent.com/ assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png ","action":{"name":"Open in GitHub","url":"https://github. com/cyber-dojo/runner_stateless"}},"updates":{"snippets":[{"icon":"DESCRIPTION","message":"ulimit for cpu will make tests timeout prematurely in multi core environments (#2)"}],"action":{"name":"View Issue","url":"https://github. com/cyber-dojo/runner_stateless/issues/2"}}}

-- Cyber-Dojo : a place to practise the collaborative game called software development. Server at http://cyber-dojo.org http://blog.cyber-dojo.org/p/learn-more.html Open-sourced at http://github.com/cyber-dojo http://github.com/JonJagger/cyberdojo Explained at http://jonjagger.blogspot.co.uk/p/cyber-dojo_2380.html Video of Roman Numerals kata in Ruby at https://vimeo.com/104548135

jelmerk commented 7 years ago

Hi Jon, thanks for responding so quickly!

My understanding is that the cpu ulimit assumes one core. But the host system running the docker container can have multiple cores, or use hyperthreading.

So assume the work for compiling and running my piece of Scala code gets run on 2 cores and both are a 100% utilized, then the docker container would be killed (from the inside) in 5 seconds rather than 10

Here's an example Java class for which the test passes when thumberOfThreads is set to 1 but fails when thumberOfThreads is set to say 8

JonJagger commented 7 years ago

Ok. That makes sense. I'm happy to make a change to the ulimit in the runners. What's your opinion on a sensible hard-limit value? 30? 40? Is there any value in increasing the soft-limit value too? Cheers Jon

jelmerk commented 7 years ago

Its hard to say what a sensible hard limit would be.. because it depends on so many factors. How many cores does the host machine have , how parallel is the workload etc.

So arguably it could be some sort of configuration parameter with a sensible default, for me 40 worked on a 4 core openstack vm or one could argue that omitting it altogether might also make sense , since as you pointed out , the process is killed from the outside after 10 seconds anyway

JonJagger commented 7 years ago

Agreed. I've removed the cpu-ulimit from all 3 runners.