Closed iocchi closed 4 years ago
@airglow, @awesomebytes, @HideakiNagano21, @justinhart, @mnegretev, and rest of TC: your thoughts, comments and proposals, please. We also need a volunteer to script this.
Thank you Lucca for the idea.
The only issue that I find with it is that, even though there is a lot of objectivity in your explanation, I'm pretty sure some teams and some judges will find a way to "subjectivize" the test. For example, lets say one of the elements to be demonstrated is navigation: one judge may provide full points for the robot just because it moved, while another may not provide any points because it "did not impress him" navigation-wise.
To solve this, I think details should be provided of what it means to "successfully achieve" an element during the test. This will take some writing, since it would need to be done for each element, but it is possible.
Unfortunately, this would drastically change the current state of the rule book, and I believe we should let it "rest" this year to let the new teams catch up.
Having said it all that, this idea is definitely doable for 2019.
I would love to hear from the rest of the TC: @airglow, @awesomebytes, @HideakiNagano21, @justinhart, @mnegretev, others?
@balkce What about starting those changes now in a 2019 branch. This way we may have next rulebook right after the competition (or even earlier), giving extra time to teams to develop and test, and much more important: this way we settle what we are aiming for in the mid-term.
@iocchi?
@kyordhel agreed. I'm not too sure how much I'd be able to write everything, but I could work on the outlines.
I created a new branch called "sportbasedscoring" to reflect a possible rule book with @iocchi's scoring ideas.
This is the core of the changes/proposals:
I left the Stage II tests and their scoring as they are right now, since I think the TC has a lot of ideas of where to push the top teams. Finals are unchanged as well.
I adapted the current rule book so as to take advantage of its useful parts (Introduction, General Rules, etc.), and changed only the relevant parts to consider the new changes.
Obviously, everything is up for debate, so discuss away.
No major changes in the rulebook for 2018. Rescheduling for 2019.
Yes. This idea is at least for 2019.
The link to the branch: https://github.com/RoboCupAtHome/RuleBook/tree/sportbasedscoring?files=1
We had a major scoring system change in going from incremental scoring to main goal scoring. I don't think we will have another major change in the foreseeable future.
Dear all, another great RoboCup@Home this year with a lot of great novelties (new leagues and new teams) and perfect organization (thanks Execs/TC/OC and LOC). Still we can improve and I think it is now time to discuss about the rules and the scoring system to solve some problems that we have noticed in the last years, but have not been solved yet.
The main problems in my opinion are:
I think these problems can be attached with a new structure of the tests and a new scoring system. The current scoring system was introduced in 2008 to replace the previous one that assigned only boolean scores to tests. By introducing scores to intermediate steps we could better evaluate performance of teams. However, this introduced complexity in writing the rules and in evaluating the tests (scoresheets longer than a page) that increased a lot over the years.
On the flight back from Nagoya, I was reading how some sport disciplines that combine different elements are evaluated. For example, evaluation of gymnastics is based on a Code of Points, a Table of Elements describing what is expected in the tests and on a mechanism to compute the score of a performance. See https://en.wikipedia.org/wiki/Code_of_Points_(artistic_gymnastics)
I think RoboCup@Home has a similar concept: we want to integrate basic functionalities in different ways and reward the solution that better integrates many complex abilities.
Since the flight was long enough, I tried to think how to adapt a scoring system like this to @Home and I tried to put down an idea that is summarized below.
RoboCup@Home score system based on sports (e.g., gymnastics)
Key concepts: a set of elements (or skills) will be defined a priori by the TC. Each element is the description of a specific ability that the robot has to demonstrate for a given category/functionality. Each element is associated to a value of difficulty. Each element will have a boolean evaluation during a test (either successfully achieved or not).
Tests: as a difference with the current rules, tests are prepared by the teams and must include some of the elements specified in the rules. These elements are the only way to get score. The elements to include, their order in the test and the way in which they are combined is decided by the teams. The test can include any other activity/task (choreography) that will not provide score, but may be useful to correlate the different elements. Teams should inform the referees about which elements the robot will perform before a test, to make refereeing easier.
Score: the score of a test will be the sum of two factors: Difficulty Score + Execution Score.
The D-Score is the sum of three values of difficulties (chosen by the team) among all the elements that are successfully achieved during the test. Only one element for each category/functionality can be chosen for the score. So to obtain a full score, a test must successfully combine at least 3 different functionalities. Teams may decide which elements to include in the evaluation of a test to minimize the penalty for repeating an element multiple times (see E-Score). The choice is made at the end of each test and cannot be changed later on. If less than 3 elements from different categories are achieved, the D-score will consider only those values and a penalty is applied in the E-Score.
The E-Score is an objective evaluation of the execution of the test, ranging from 0 to 10. E-Score is computed as 10 - Penalties (with a minimum value of 0). Penalties include in particular missing elements (i.e., less than 3 elements from different categories achieved in the test) and guarantees the following maximum scores: Test with >= 3 elements -> max score = D1+D2+D3 + 10 Test with 2 elements -> max score = D1+D2 + 6 Test with 1 element -> max score = D1 + 3 Test with 0 elements -> score = 0
More specifically, penalties may be defined as follows:
Organization of the competition
Each team will have slots of 10 minutes to perform tests within this schema.
To obtain the best score, each test has to integrate 3 different functionalities and different tests have to test different abilities. So the best team will perform different tests combining in different ways complex abilities.
Of course this can be done only for a few tests (e.g., only on Stage I, only on Day 1, etc.) I think this concept should anyway replace the Open Challenge and be adopted also in the final.
Advantages
Teams: can freely choose what to show, it would always possible to achieve some score, there will be no reason for skipping a test. All the team members will be happy to see their component running at some point. Teams can freely choose to repeat elements multiple times (with a small penalty). Teams can put in a test many elements, try all of them, and decide after the test which among the ones successfully completed will be counted in the score.
TC: rulebook simplified, no need to specify full tasks, no need to make choices and setup before the tests, it is responsibility of the teams to ask referees to do actions (e.g., randomize positions, objects, sentences, etc.) to achieve score.
Referees: do not need to learn the specifications of several tests. During the test, they just check if an element is achieved or not (boolean). No subjective evaluation. Score of a test is easily computed in an automatic way from a set of boolean evaluations of the referees.
Audience: they see something running. A moderator explaining what is going on, with possibility of making stories around the tests.
Scientific community: tests with no effort in solving @Home tasks (e.g., fully scripted open-loop demos) will score zero. Possibility of comparing the performance of teams on each element and on an entire category. Easy to determine "best-in-class" in the different functionalities. Specification of elements will be driven by problems actually studied in scientific literature, hopefully attracting more attention from researchers that can find the problem they are working on in this list.