New scoring system - Githubissues

iocchi commented 6 years ago

Dear all, another great RoboCup@Home this year with a lot of great novelties (new leagues and new teams) and perfect organization (thanks Execs/TC/OC and LOC). Still we can improve and I think it is now time to discuss about the rules and the scoring system to solve some problems that we have noticed in the last years, but have not been solved yet.

The main problems in my opinion are:

Writing the rulebook is very complex and it is left on the hands of only a few people.
Implementing the tests by TC is difficult (not easy to delegate to volunteer referees, so requires TC to be present to evaluate every test) and sometimes the spirit of the test is not fully captured by the referees.
Teams are often not happy because of some rigid constraints in the rules, and sometimes they decide to not participate at all to some tests (probably because of their complexity). Some team members are not happy because the robot never reaches a point in which their development is tested, due to previous failures.
The Open tests (challenge and final) tends to be very conservative (no need to risk some autonomous behaviour or a "real" speech recognition phase) and sometimes seem to not seriously address the functionalities, tasks and scientific objectives of RoboCup@Home. In most cases, we still see more human presentations than robot tasks.
PhD students and researchers rarely find the subjects of their research in @Home tests. Also (maybe as a consequence) there are not so many publications from @Home teams.
Audience rarely see something interesting and they have difficulty in understanding what's going on.

I think these problems can be attached with a new structure of the tests and a new scoring system. The current scoring system was introduced in 2008 to replace the previous one that assigned only boolean scores to tests. By introducing scores to intermediate steps we could better evaluate performance of teams. However, this introduced complexity in writing the rules and in evaluating the tests (scoresheets longer than a page) that increased a lot over the years.

On the flight back from Nagoya, I was reading how some sport disciplines that combine different elements are evaluated. For example, evaluation of gymnastics is based on a Code of Points, a Table of Elements describing what is expected in the tests and on a mechanism to compute the score of a performance. See https://en.wikipedia.org/wiki/Code_of_Points_(artistic_gymnastics)

I think RoboCup@Home has a similar concept: we want to integrate basic functionalities in different ways and reward the solution that better integrates many complex abilities.

Since the flight was long enough, I tried to think how to adapt a scoring system like this to @Home and I tried to put down an idea that is summarized below.

RoboCup@Home score system based on sports (e.g., gymnastics)

Key concepts: a set of elements (or skills) will be defined a priori by the TC. Each element is the description of a specific ability that the robot has to demonstrate for a given category/functionality. Each element is associated to a value of difficulty. Each element will have a boolean evaluation during a test (either successfully achieved or not).

Tests: as a difference with the current rules, tests are prepared by the teams and must include some of the elements specified in the rules. These elements are the only way to get score. The elements to include, their order in the test and the way in which they are combined is decided by the teams. The test can include any other activity/task (choreography) that will not provide score, but may be useful to correlate the different elements. Teams should inform the referees about which elements the robot will perform before a test, to make refereeing easier.

Score: the score of a test will be the sum of two factors: Difficulty Score + Execution Score.

The D-Score is the sum of three values of difficulties (chosen by the team) among all the elements that are successfully achieved during the test. Only one element for each category/functionality can be chosen for the score. So to obtain a full score, a test must successfully combine at least 3 different functionalities. Teams may decide which elements to include in the evaluation of a test to minimize the penalty for repeating an element multiple times (see E-Score). The choice is made at the end of each test and cannot be changed later on. If less than 3 elements from different categories are achieved, the D-score will consider only those values and a penalty is applied in the E-Score.

The E-Score is an objective evaluation of the execution of the test, ranging from 0 to 10. E-Score is computed as 10 - Penalties (with a minimum value of 0). Penalties include in particular missing elements (i.e., less than 3 elements from different categories achieved in the test) and guarantees the following maximum scores: Test with >= 3 elements -> max score = D1+D2+D3 + 10 Test with 2 elements -> max score = D1+D2 + 6 Test with 1 element -> max score = D1 + 3 Test with 0 elements -> score = 0

More specifically, penalties may be defined as follows:

P =
// missing elements
4       only 2 elements
7       only 1 element
10      no elements
// repetitions of elements over tests
D_i (1 - \gamma^N_i)   (N_i : times an element E_i considered in this score was already counted in previous scores), \gamma = 0.9
// special rules
1       no moderator explaining what is happening
(others) ...

Organization of the competition

Each team will have slots of 10 minutes to perform tests within this schema. To obtain the best score, each test has to integrate 3 different functionalities and different tests have to test different abilities. So the best team will perform different tests combining in different ways complex abilities.

Of course this can be done only for a few tests (e.g., only on Stage I, only on Day 1, etc.) I think this concept should anyway replace the Open Challenge and be adopted also in the final.

Advantages

Teams: can freely choose what to show, it would always possible to achieve some score, there will be no reason for skipping a test. All the team members will be happy to see their component running at some point. Teams can freely choose to repeat elements multiple times (with a small penalty). Teams can put in a test many elements, try all of them, and decide after the test which among the ones successfully completed will be counted in the score.
TC: rulebook simplified, no need to specify full tasks, no need to make choices and setup before the tests, it is responsibility of the teams to ask referees to do actions (e.g., randomize positions, objects, sentences, etc.) to achieve score.
Referees: do not need to learn the specifications of several tests. During the test, they just check if an element is achieved or not (boolean). No subjective evaluation. Score of a test is easily computed in an automatic way from a set of boolean evaluations of the referees.
Audience: they see something running. A moderator explaining what is going on, with possibility of making stories around the tests.
Scientific community: tests with no effort in solving @Home tasks (e.g., fully scripted open-loop demos) will score zero. Possibility of comparing the performance of teams on each element and on an entire category. Easy to determine "best-in-class" in the different functionalities. Specification of elements will be driven by problems actually studied in scientific literature, hopefully attracting more attention from researchers that can find the problem they are working on in this list.

kyordhel commented 6 years ago

@airglow, @awesomebytes, @HideakiNagano21, @justinhart, @mnegretev, and rest of TC: your thoughts, comments and proposals, please. We also need a volunteer to script this.

balkce commented 6 years ago

Thank you Lucca for the idea.

The only issue that I find with it is that, even though there is a lot of objectivity in your explanation, I'm pretty sure some teams and some judges will find a way to "subjectivize" the test. For example, lets say one of the elements to be demonstrated is navigation: one judge may provide full points for the robot just because it moved, while another may not provide any points because it "did not impress him" navigation-wise.

To solve this, I think details should be provided of what it means to "successfully achieve" an element during the test. This will take some writing, since it would need to be done for each element, but it is possible.

Unfortunately, this would drastically change the current state of the rule book, and I believe we should let it "rest" this year to let the new teams catch up.

Having said it all that, this idea is definitely doable for 2019.

I would love to hear from the rest of the TC: @airglow, @awesomebytes, @HideakiNagano21, @justinhart, @mnegretev, others?

kyordhel commented 6 years ago

@balkce What about starting those changes now in a 2019 branch. This way we may have next rulebook right after the competition (or even earlier), giving extra time to teams to develop and test, and much more important: this way we settle what we are aiming for in the mid-term.

@iocchi?

balkce commented 6 years ago

@kyordhel agreed. I'm not too sure how much I'd be able to write everything, but I could work on the outlines.

balkce commented 6 years ago

I created a new branch called "sportbasedscoring" to reflect a possible rule book with @iocchi's scoring ideas.

This is the core of the changes/proposals:

The new type of scoring is applied only to Stage I tests (I refer to them as "demonstrations").
Eliminated Open Challenge since it's redundant with the new scoring.
Stage 1 is a number of demonstrations per team: three. Nonetheless, changing this is trivial.
Key concepts are written and coded in such a way that the TC/EC can eliminate or add new ones without affecting the rest of the rule book.
Several initial key concepts are proposed.

I left the Stage II tests and their scoring as they are right now, since I think the TC has a lot of ideas of where to push the top teams. Finals are unchanged as well.

I adapted the current rule book so as to take advantage of its useful parts (Introduction, General Rules, etc.), and changed only the relevant parts to consider the new changes.

Obviously, everything is up for debate, so discuss away.

kyordhel commented 6 years ago

No major changes in the rulebook for 2018. Rescheduling for 2019.

balkce commented 6 years ago

Yes. This idea is at least for 2019.

The link to the branch: https://github.com/RoboCupAtHome/RuleBook/tree/sportbasedscoring?files=1

johaq commented 4 years ago

We had a major scoring system change in going from incremental scoring to main goal scoring. I don't think we will have another major change in the foreseeable future.

RoboCupAtHome / RuleBook

New scoring system #358

RoboCup@Home score system based on sports (e.g., gymnastics)

Organization of the competition

Advantages