Open Defimatt opened 8 years ago
I think result would be quite different from the real world, considering that the current bot behaves quite different from normal players I think.... For example, the current bot is not so aggressive in hunting for food trails left by dead snakes, it doesn't encircle other snakes and I think if a bot meets another bot, they will probably just avoid each other and go separate ways.
@MattDuffin Great idea! Go for it! :+1:
@dingyangheh Agreed, however we could create an arena without any bot, just one human player and the bot. The human player would then be able to test the behaviors of the bot. I'm 100% into this. Could you do it @MattDuffin ?
I have tested the server and its really not useable yet for testing any bot script behavior or analyzes..
Any others in a stable condition ?
I'm not sure we need to host our own server in order to test bots. What's wrong with testing on the real servers? I suppose there would be some variance... You could get around that by running more tests and choosing the median score, or something. Anyone here familiar with statistics?
@j-c-m has been doing manual performance testing; he may be able to provide suggestions.
@ChadSki We need a consistent environment so we can see the real difference. We can only do that with a custom server...
Well, if others want to do the work to setup a custom server, then okay.
It's possible to measure differences even in a relatively inconsistent environment, as long as you use statistics correctly. Also if we're doing a head-to-head comparison of two bots, they can just be tested at the same time, which would make everything pretty consistent server-wise.
@FliiFe, @ChadSki the reason I suggested having a server with a consistent setup, and with opponents that act deterministically is that the test results would be directly comparable. Imagine you're trying to work out the speed of light, and the universe had different laws of physics each time you ran the experiment... the results you got would not be comparable and you wouldn't be able to draw any inferences from them.
I agree that testing against human opponents would be useful @FliiFe, but as @ChadSki said, you'd have to be very careful with your application of statistical methods, which is beyond my abilities. If anyone else is able to supply the relevant statistical formulae, I would be able to code a test harness that would apply them to two bot instances.
@Salvationdk thanks for testing out the server, it's a shame it's not ready yet.
Sure. If you test Bot A versus Bot B on Saturday, May 28th, 2016 it's going to be different than results on Sunday, June 5th. But the A-B comparison is viable because you ran them at the same time, on the same server.
I certainly do not recommend comparing the score of Bot A from Saturday with the score of Bot B on Sunday.
It's also important that the bots are tested against other human players, as it will actually be used on the official Slither.io servers.
Running on a private server would make the scores more comparable across time, but it's liable to be a less accurate measure of how well a bot actually does on the official servers.
I don't think it's important that we get an objective bot score, as long as we can simply test incoming changes to the AI and make sure the bot performs better.
@ChadSki ok, I'm pretty convinced by that argument. I do however feel that we need input from an expert on statistics to make sure we know how many tests to run, and how to interpret results correctly to make sure we're not saying branch A is better than branch B due to an anomaly rather than a true improvement in performance.
So we need to run different versions of the bot at the same time on the same server against the same human players while recording different factors ?
@ErmiyaEskandary that's the gist of it!
Ah - so if we get this up and running, we would always compare the current bot with another instance of the bot running with modifications , seeing if it makes a difference...
Related to #129
@ErmiyaEskandary correct. As long as we have input from someone familiar with statistics, we should then be able to determine which is better. We'd need to do some statistical analysis because the two bots won't be starting at exactly the same part of the arena and so will be subject to different environmental conditions.
(Similar to how two identical twins who can have almost exactly the same life experience / environmental conditions, but small differences in their experience send them off on tangents that result in them being very different people.)
Needs #139 so it can be a fair comparison...
I agree it depends on #139 and yes it's similar to #129
I do however feel that we need input from an expert on statistics to make sure we know how many tests to run
One of my friends is a psychologist, and statistics is very important in his field -- it's expensive to run tests involving humans, so they need to be very sure to get a statistically significant result from as few people as possible.
On the other hand, from what I've seen in computer science... most people just run boatloads of trials and hope that the math works out. :laughing:
I would think each bot should get at least 50 lives (in another thread I saw j-c-m ran 70 lives each). More is better... we could shoot for 250+ lives but that might take too long. We could probably speed up the trials by running a few tabs in parallel. Perhaps five tabs each with 50 lives?
It might be fun to compare identical bots, just to see if the test can tell that their performance is supposed to be identical. That'd be a good sanity check for whichever method of testing we choose.
:+1:
We can just run 30 lives, for the time being - we can later get a statistics guy to do the stuff ... lol
@ChadSki I like the idea of running two versions of the same bot against each other as a check on the test methodology, kind of like how the scientists running the gravitational wave detectors fed fake results into the input stream to see if the operators were paying attention...
What will we monitor ? Just the length and time ?
@ErmiyaEskandary just length upon death?
I don't think there's any point measuring rank, since the number of participants on a server is arbitrary, so achieving rank 18 on day 1 isn't the same as achieving rank 18 on day 2 because there are a different number of people in the sessions.
Yeah - we will limit the number of deaths and then compare the scores. Seems like the simplest way. We could also get the average score within a specific timeframe.
Monitoring rank would be useless - the rank changes constantly based on the server and tells us nothing.
Precisely, but I've seen a couple of bots (specifically the DNA-based bot that was mentioned in one of the discussions here) that takes rank into consideration, so I wanted to make sure we don't consider it from the offset.
:+1:
the DNA bot was oaky somewhat, but not really as good as this bot, my oppinion.. ive tested a few and none so far gets close to this version..
The bot can be run with very minimal graphics by commenting out original_redraw() from the loop and turning off visual debugging. Although nothing will draw in the browser now other than overlay the bot plays fine. I have 4 instances running right now on an azure d2v2 node.
When testing keep in mind the tab must remain in the foreground or chrome will slow it down to 1 update per second.
from l->r 1.2.6 - mainline bot 1.9.0 - j-c-m bot 1.2.1 - @kmatheussen bot 1.9.0 - j-c-m bot with 1.5 fullheadcircle (red)
Thanks @j-c-m, that's useful to know. Will keep it in mind when developing the feature.
Could everyone look back at #208 please ? I think if the project grows up so much, We definitely should make an organization, to keep peripherical projects aside form the main bot.
peripherical projects
@FliiFe ?
Oh, I forgot this part... If we do this, we should make it as a different repo, which has everything set up to run loads of test on the bot, and a ready-to-go server+bot implementation you can connect to.
If the explanation wasn't clear, don't care about it. The main idea is that this whole issue would make a "peripherical project"
@FliiFe Either that or it could be in a tests folder of the project.
tab must remain in the foreground or chrome will slow it down to 1 update per second.
Good to know.
After reading through:
260
259
258
234
225
210
175
147
129
120
78
, all of which suggest improvements to the bot's behaviour, I realised we need some way of objectively measuring changes to the bot, and how they affect its performance.
What if we created a test arena (possibly using https://github.com/iiegor/slither), populated it with a number of bots of various flavours and then set our bot running on it for a number of test runs.
That way, we can take an average of the scores from the test runs and determine whether changes from a new version are positive, and by how much before making them public.
Because the initial conditions of the server will be the same each time (other snake behaviour, number and position of snakes, food spawning pattern and behaviour [by controlling the random number generator of the server]) this will be a fair and comparable test.
The only downside is that there are no human players present, so we won't be able to test the bot's ability to cope with non-deterministic behaviour.
If we're interested in doing this, I'm happy to give it a go. It will take me a few days due to work commitments, unless anyone else wants to give it a go.