kyordhel commented 7 years ago

Question in a nutshell

How much the rulebook will change this year?

Issue in a nutshell

First, define the deadline for the first draft (e.g. October 15, deadline for SSPL CFP).

If the rulebook is going to be changed and not just buffed, define

Which features are important to test. 2.1 To where will @Home direct research 2.2 Which tasks are relevant and when they need to be solved 2.3 What are the milestones (sob goals) to have these tasks solved
Which features or abilities are required to achieve the before mentioned goals
Which test will change
To which extent these test will change

Introduction

Right after the RoboCup fever we got a lot of ideas, many of them very nice but that also involve MAJOR changes to the rulebook. The transition from 2016 to 2017 introduced several major changes from which the least relevant was the inclusion of DSPL and SSPL. Other major changes were the fusion of the Speech Recognition and Audio test with the Person Recognition test, the merge of Following & Guiding test with the Navigation test into the Help-me-carry test, the removal of the training and guiding phase in the Restaurant test, and the inclusion of the Set a table and clean it up test.

In the past, the rule was to introduce major changes to the rulebook every two years, but major changes have been introduced every year since 2015, so the Executive Committee and the Trustees advised to keep the rulebook as is and introduce only minor changes (buffs) to give teams to learn, prepare, and outmatch the test in 2018.

However, it seems to me that keeping the same Rulebook for 2018 not necessarily matches the spirit of all the changes suggested by the new TC members, some Execs, and even Trustees. Therefore, this decision must be taken once and for all, with a strong commitment of the TC of having all changes ready before the deadline.

Deadlines

At the time this issue was created the Call for Participation for the Social Standard Platform League has been issued with the deadline for the Qualification Materials' submission scheduled for October 15, 2017. The idea is that the other two leagues also follow this advance. The intention in a nutshell is that Qualified Teams are announced before the end of the year, making easier for teams to get funds and confirm their participation several months before the competition, while at the same time opening the opportunity for a second CFP if some qualified teams can't attend to the competition. In addition to this, and considering the large amount of applicants of 2017 (more than twice the capacity for OPL), a video demonstrating a Stage 2 test might be required for qualification to ensure only well-prepared teams. This shortens the time for releasing the rulebook.

As consequence, an early deadline for the first draft must be set, as early as middle of October of this year 2017. This requires strong commitment from the TC due to the necessary extra effort.

The following deadlines have been pre-set, and must be discussed and agreed:

August 25th (EDIT: Was in September 23th)
- CFP release
September 15th (EDIT: Moved to October 15th)
- CFP: Participation Intention deadline
- Rulebook Startup: Goals, Milestones, Roadmap, List of changes, etc.
October 1st
- Completion of all Stage I tests (first draft version)
October 7th
- Completion of all Stage II tests (first draft version)
October 15th (EDIT: Moved to November 6th)
- CFP: Deadline for Qualification Materials Submission
- Rulebook's first draft
October 30th (EDIT: Moved to November 26th)
- CFP: Qualified Teams Announcement
November 30th
- Completion of all Stage I tests (second draft version)
January 31st, 2018
- Completion of all Stage II tests (second draft version)
May 1st, 2018
- Release of Rulebook's final version after evaluating local tournaments

It's a very tight schedule, yet doable.

Milestones and Features

During a meeting in Nagoya, the EC and TC agreed in that several milestones must be set, aiming to push robots and research toward actually solving very specific tasks such as: taking clothes out of the washing machine and folding them (laundry), remind an elder of their daily dose of insulin (nursery), make a sandwich (act as butler), give a tour in a museum (waiter), etc.

Such tasks involve certain specific abilities such as folding deformable objects, dealing with wet objects, manipulate tools, react to natural language queries, find people, keep track of the events and activities of a person, etc. None of them are solvable yet, and robots are not even close to achieve such feats, so milestones must be set and the test must be designed accordingly, having plans on what to change and when, and what to do if the tests are not being solved.

Suggested work plan

The suggested work plan is as follows.

Review the degree of success of all tests in Stage 1 and 2
Analyse the performance of robots in previous years regarding basic features
Deduct the state of the art of most robots.
Decide whether major changes are introduced or not.
Define milestones and features
Adapt test as necessary

Related issues:

362, #356, #353, #352, #351, #347, #308, #261, #179

LoyVanBeek commented 7 years ago

Before RoboCup, I've been thinking and writing about this already. My personal views ar written at https://github.com/LoyVanBeek/RoboCupAtHome-roadmap.

LoyVanBeek commented 7 years ago

As for success, I don't have the full score sheets but the finals cores. With those, I did some very basic statistics.

OPL stage 1

	Poster	Speech & Person	Storing Groceries	Help Me Carry	GPSR	Stage 1
max attainable	50.00	200.00	250.00	200.00	250.00	950.00
MODE	36.67	0.00	0.00	0.00	0.00	#N/A
MAX	40.00	145.00	22.50	70.00	70.00	249.00
MIN	27.78	0.00	0.00	0.00	0.00	29.44
MEDIAN	32.78	35.00	5.00	25.00	10.00	104.78
AVERAGE	33.54	53.33	4.83	25.60	13.73	131.04
STDEV	3.52	53.54	6.51	24.69	20.29	81.18

OPL stage 2

	Open Challenge	Set a table / tidy up	Restaurant	EE-GPSR	Stage 2
max attainable	250	390	285	250	2125
MODE	#N/A	10.00	#N/A	0.00	#N/A
MAX	194.54	20.00	115.00	65.00	621.04
MIN	62.96	0.00	-50.00	0.00	191.63
MEDIAN	146.76	10.00	27.50	32.50	392.01
AVERAGE	132.41	11.25	42.81	28.75	407.80
STDEV	54.56	6.41	58.85	24.75	176.20

SSPL Stage 1

	Speech & Person	Cocktail Party	Help Me Carry	GPSR	Stage 1
max attainable	200		200	250	700
MODE	#N/A	0.00	0.00	0.00	#N/A
MAX	117.50	30.00	10.00	42.50	245.00
MIN	17.50	0.00	0.00	0.00	66.25
MEDIAN	50.00	7.50	0.00	0.00	91.67
AVERAGE	58.64	10.71	2.14	9.64	116.08
STDEV	33.57	12.97	3.93	15.91	65.70

SSPL stage 2

	Open Challenge	Tour guide	Restaurant	EE-GPSR	Stage 2
max attainable	250.00		285.00	250.00	1485.00
MODE	#N/A	0.00	#N/A	0.00	#N/A
MAX	178.47	95.00	40.00	70.00	628.47
MIN	121.53	0.00	0.00	0.00	243.47
MEDIAN	133.68	0.00	12.50	10.00	271.29
AVERAGE	141.84	23.75	16.25	22.50	353.63
STDEV	25.21	47.50	17.02	33.04	184.05

DSPL Stage 1

	Poster	Speech & Person	Storing Groceries	Help Me Carry	GPSR	Stage 1
max attainable	50.00	200.00	250.00	200.00	250.00	950.00
MODE	37.86	0.00	0.00	0.00	0.00	#N/A
MAX	41.43	101.00	30.00	30.00	25.50	209.36
MIN	31.79	0.00	0.00	0.00	0.00	36.07
MEDIAN	37.50	2.50	0.00	0.00	0.00	51.25
AVERAGE	37.00	24.70	6.75	4.75	5.35	78.55
STDEV	2.56	36.72	10.14	10.44	8.66	55.19

DSPL Stage 2

	Open Challenge	Set a table / tidy up	Restaurant	EE-GPSR	Stage 2
max attainable	250.00	390.00	285.00	250.00	2125.00
MODE	#N/A	0.00	0.00	0.00	#N/A
MAX	183.68	10.00	90.00	55.00	524.08
MIN	0.00	0.00	0.00	0.00	107.29
MEDIAN	139.58	0.00	5.00	10.00	257.73
AVERAGE	114.51	4.00	20.00	20.00	275.33
STDEV	72.79	5.48	39.21	24.24	165.85

balkce commented 7 years ago

Thanks @LoyVanBeek I've read through all of your material, and I think there is a lot that I agree with.

First off I agree with your thought that the overall performance going down "is due to the rulebook not being stable". In this regard, I believe we need to make 2018 rulebook an evolution of 2017's.

However, this goes in contrast with another issue that @LoyVanBeek has pointed out: "We want to test too much". It's very tempting to fix this issue by reducing the number of tests (I really don't want to take away repetitions). And reducing the number of tests implies test rework, which in itself implies major changes.

I propose to:

Remove some of the tests, and fine tune the rest. We can decide which to take away based on an agreed up roadmap of skills/behaviors we want to have in 2020. @LoyVanBeek document in https://github.com/LoyVanBeek/RoboCupAtHome-roadmap/blob/master/future.md provides a good starting point. We can "fine tune" them in such a way that the tests are an initial version of more complex tests in following years (very much like the Follow Me and Restaurant tests).
For the tests that are removed, their tested skills could be tested to GPSR. Meaning, a "theme" could be set for each round of attempts. For example, in the first round, person recognition is going to be tested; in the second, navigation; in the third, speech understanding.
Rework the remaining tests to be GPSR-based: no points are given for ASR, they can use QR code or a vizbox to obtain the command. I don't care how the command is given, I just want to see if the robot is able to do the test or not. Why GPSR based? Because then different versions of the command (e.g. "get a coke" instead of "get some chips") can be given, no team will receive the same version of the command. This is to start testing flexibility (another issue @LoyVanBeek pointed out). This is only the first step in start testing this. I'm aware that there will still be a lot of inflexible planning that can solve the test, even more so considering that a static variable-based grammar will need to be provided before hand for all the tests (so there isn't an issue about command understanding). The idea is that in 2020, I want to test only GPSR; I think designing the current tests in this manner is step forward in that direction.

In regard of which tests to take away and which ones to keep and fine tune:

I'm fine in removing Speech and Person Recognition completely. To replace it, we can do a person recognition theme for one of the GPSR attempt rounds or tested a bit more in Help Me Carry. In addition, we can do an attempt round for for speech understanding in either GPSR or EEGPSR. This also allows us to test broader environment awareness as part of the questions (e.g. "where is the coke"). @LoyVanBeek and @kyordhel have repeatedly and correctly pointed this out and I think it's time to do this. As far as sound source localization goes, it is tested in Restaurant.
Storing groceries: opening the door is good barrier to put up there, and we could leave it to force teams to reach it. Providing a model of the handle, as discussed in #362, is a good step in that direction. However, it should not test environment awareness (too much to ask for one test): the robot has to recognize where are the groceries and where is the shelf in addition to object categorization and manipulation. This information should be provided at the start of the test as a GPSR-type command (e.g. "put the groceries that are in the kitchen table in the side shelf"). This way the robot knows where it is suppose to go, frees the judges to run the test in batches if need be, and can even be explanatory for the audience. In later years we can add location/furniture recognition, and ask the robot to identify the table and the shelf automatically as where to pick up the groceries and where to put them (the robot could be manually put in the visual range of both). In this manner, this test could even be carried out outside the arena.
Help me carry has a good story, and even has a GPSR-type command included in it. The only issue would be the bag. I really don't see why we are also testing object recognition in this test. Take away that part (just do a handover), and put more focus in the part where the robot guides the operator, maybe even do some Person Recognition here. It makes more sense this way.
Set a Table is a good idea, but without the cutlery, it's just another go-and-fetch type of test. And none of the teams I tested went for the cutlery. We can either remove this test, or make the manipulation more accessible and in later years aim for this version of the test.

That is all for now.

LoyVanBeek commented 7 years ago

As for themes: please stick with the current challenges as a theme for GPSR-like runs, to keep the rulebook stable.

Some other suggestions:

Replace Speech & Person Recognition with some sort of environment understand and question answering about the environment. This can be almost the same test as it is now, maybe extend it with a more dynamic environment and questions reflecting it. As for sound source localization: I want to give commands to my robot while sitting on my couch or chair, not walk to it and stand up for it.
In Set-a-table: focus on the complex manipulation: small objects like cutlery, hard to grasp ones like plates and maybe stretch it by putting them in a dishwasher, then swipe the table. I don't really care about the selection of the meal or the order in which the robot does things. Maybe only set: Pre-condition is: table with plate, cutlery, breadcrumbs. Post-condition: clean table. Bonus: stuff put in the dishwasher.

kyordhel commented 7 years ago

Long post @balkce! Thank you.

First I would call to attention to the fact that, again, discussions are mainly being held between the EC + @LoyVanBeek, @moriarty, and I. TC has nine members, and I would like to see them all reach consensus in #217, #308, #347, #352, #356, #358, #359, #361, and #362. I find Luca's and Sven's suggestion worthy.

Second, the rulebook's instability can't be considered as the predominant factor for the low performance, at least not in Stage I. Even less if we consider that the average performance is less than 10% of the maximum score. I would rather aim for an ill design of the tests because of the following (non quantitative) reasons:

Teams aim for easy points, and don't dare to pursue the big rewards.
Tests are too linear. Failing an early step spoils everything.
Three tries might be too much of a burden in Stage I, leading to exhaustion.
The idea of three tries was give the chance to fix and adapt between tries, something that haven't happened yet (performance maintains or decreases).
Lack of milestones.

Third. Some clarifications:

Thematic GPSR is pointless (not GPSR anymore) and this is more evident HRI during task acquisition (critical ability in SSPL) is bypassed. Better to have mini retro-tests kinda: go-get-it, who-is-who, and follow me. In fact, what you are suggesting seems to me like rolling back to those tests of 2011.
I don't think is wise to remove the Speech Recognition test, but upgrade it. From the beginning the plan was to move from predefined questions to NLP with Action Planning (the robot explains the plan, does not executes it). This far, I've seen very few robots with this ability, and those who have it barely use it due to the linearity of the tests (State-Machine like behavior).
Storing Groceries and Help-me-carry. One thing is describe the test in terms of non-sorted GPSR actions, the other give those actions to the robot; something that is unnatural and goes against #358 where @iocchi suggests breaking the linearity of the tests and just mark goals, easing the scoring. Those tests are atomic tasks that require a lot of local planning, and so must be kept, I can't imagine any mother explaining to you how to help aunt with the groceries.

Finally, since no counterargument is complete without a suggestion for improvement, here are my 2 cents:

Do major changes in how tests are described and scored, but keep (most) tests as they are.
Describe tests in terms of goals, not actions or sub-tasks to accomplish.
- Order is (mostly) irrelevant.
- If the robot meets the goal, scores.
- No PDF reports, the tasks must be solved with visible results.
- Remove all abilities that don't lead to the milestones.
- The goals use keywords related to GPSR commands (e.g., go, follow, deliver, find etc.).

Some examples

Storing Groceries
- Goal: Bring all objects from a near table to the shelf, grouping them by category.
- 5 groups are scored.
- Goal: Open the door of the cupboard.
- Remark: some of the objects are cutlery or tableware.
Speech & Person Recognition (based on #352)
- Goal: Find 4 people in the Livingroom
- Goal: Fetch a command from each person found.
- Goal: For each command retrieved, explain how it could be accomplished (action planning). There is no need to execute it.
Help me carry
- Goal: Find the car outside the house (e.g. by following a person).
- Goal: Bring the bag with groceries back to the house.
- Goal: Find a volunteer to help carrying the groceries in.
- Goal: Guide the volunteer found to the car.
- Remark: Note that it is valid that the robot first find a volunteer and then follows the operator to the car while at the same time is guiding the volunteer, burning two transistors with the same spark.
Set a Table:
- Goal: Place a fork, spoon, knife, or any other cutlery object.
- Goal: Place a dish, bowl, place mat, or any other tableware object.
- Goal: Place a third cutlery or tableware object.
- Goal: Place an object of the food category on a dish or on a bowl; or pour an object of the drinks category in a bowl, mug, cup, or glass.

Till here my thoughts

johaq commented 7 years ago

No PDF reports, the tasks must be solved with visible results.

In my opinion pdf reports have the advantage for the referee to easily see if the visible result is actually based on robot perception or just a guess / something hardcoded.

I also thought the reports were a good way to see how other teams performed in a task that is more informative than just the score.

kyordhel commented 7 years ago

@johaq

In my opinion pdf reports have the advantage for the referee to easily see if the visible result is actually based on robot perception or just a guess / something hardcoded.

Please counterargument the following:

The audience couldn't care less about PDF report.
PDF reports have the annoying side effect of robots doing only object recognition during a manipulation test.
PDF reports have the annoying side effect of delaying score delivery (12 teams 3 tries 3 leagues = . 108 reports).
PDF reports make scoring harder (all test should be scored DURING the test IN the arena).

Considering your point, I wouldn't mind to check the PDF report if the robot did something, to check false positives and, as you say, hard-coding (which violates the fair play rule). If the robot didn't achieve a single goal, it won't score. Period.

johaq commented 7 years ago

The audience couldn't care less about PDF report.

Definitely true for the audience on location. #357 and #355 raise requests to be able to better judge the robots performance from outside which I think published task reports can provide.

PDF reports have the annoying side effect of robots doing only object recognition during a manipulation test.

If you do not want robots to do certain things in a task don't award points for it.

PDF reports have the annoying side effect of delaying score delivery. PDF reports make scoring harder

Having never scored a competition I don't know very much about this. Maybe a latex template that teams HAVE to use could make this easier.

If the robot didn't achieve a single goal, it won't score. Period.

Agreed. I think this would come with the more goal driven scoring suggested here.

awesomebytes commented 7 years ago

October 30th CFP: Qualified Teams Announcement

Is this a fixed date? I haven't seen in the CFP any reference to it (for the SPL leagues the qualification material deadline is 15 Oct, OPL 5 Nov as per the CFP web).

I'm asking because teams that want to purchase a robot with a budget of year 2017 will find very helpful to have this information before 2018 (and as early as possible).

kyordhel commented 7 years ago

@awesomebytes

DSPL: October 7th (Call for proposals) SSPL: October 15th (Qualification materials) OPL: November 5th (Qualification materials) / November 26th (Qualification Announcement).

So far only OPL has all deadlines set. SPL have a more complex selection procedure, so I can't tell when they will announce something.

kyordhel commented 6 years ago

Closing due to:

No major changes to be done
All deadlines expired

RoboCupAtHome / RuleBook

Deadline, Milestones, and Features #361

Question in a nutshell

Issue in a nutshell

Introduction

Deadlines

Milestones and Features

Suggested work plan

362, #356, #353, #352, #351, #347, #308, #261, #179

OPL stage 1

OPL stage 2

SSPL Stage 1

SSPL stage 2

DSPL Stage 1

DSPL Stage 2