Make tests more "scientific"

LoyVanBeek commented 8 years ago

This is difficult, as we also want to have variation and no hard-coded test execution. We'll have to balance repeatability and variation.

kyordhel commented 8 years ago

To me, sounds tricky and non-robocup. What does it mean more scientific?

Teams don't come to test in a lab-like environment. Unpredictability and uncontrolled environment is what we desire. I talked with some trustees about this particular toipic and they told: we intend to bring robots (science) to people, no robots to science

balkce commented 8 years ago

[strap yourselves in, this is going to be a long one]

Although I agree with Loy and Mau in that this topic is tricky (for example: scientific tests are rigorous, which in turn means time, which we don't have in a competition), I think we have to fair here: "scientific" does not necessarily imply predictable. When I publish results in a scientific journal, I carry out tests that are randomized, to check in which circumstances does my algorithm needs improvement and in which it works. And, yes, even though these tests are carried out in "controlled" environments, the controls are chosen to compare with standards so the reader has a frame of reference to judge if the results are good or bad, not to make the test easy for the algorithm.

"Controlled" does not necessarily means "easy". A "controlled" environment may very well be something close to or even more dynamic than a real life environment, if it is chosen properly. In fact, even though the RoboCup@Home arena is difficult and dynamic, it is not at all "uncontrolled": objects are expected to be in certain places, shelves heights are within a given range, commands are given in certain points during tests (ala script), etc. These are all "controlled" variables so that all robots can be compared with one another ala apples to apples.

Again, I'm not saying that I'm agreeing completely with the issue (specially since "scientific" may mean a number of things), but I think we can come to a nice middle ground that benefits "science", "robots" and "people". I mean, even repeating the tests in Stage 1 can be considered a step forward towards being "scientific" as well as being welcomed by teams for taking off the pressure of working in the first try. And I'm sure we won't get to this middle ground in the next rule book version (or ever), but I think some measures can be taken to this effect.

For instance:

Repetition is good, although complex for scheduling. Maybe reduce number of tests so this is kept. There is even a topic on how there are some tasks that are already "solved", so we could skip testing those: https://github.com/RoboCupAtHome/RuleBook/issues/200
Maintain similarities between attempts, so that scores from all attempts are comparable. However, these similarities should be within the range of variables that we are interested in testing to avoid teams coding for points instead of coding for the test (this is a tough one, though, I admit).
Mau's idea of gathering a set of functionalities we want to evaluate and then building a test around it can actually be steered towards the "scientific" route, since it basically means "write the score sheet first" or, to put it 'scientifically', "what do I want to test for?" And, then, we build an environment/test around it, with proper controls for comparability.

TL,DR: both camps (competition and science) aren't at war with each other, but it is going to be tricky to come to a compromise.

C

On Thu, Jul 14, 2016 at 11:08 AM Mauricio Matamoros < notifications@github.com> wrote:

To me, sounds tricky and non-robocup. What does it mean more scientific?

Teams don't come to test in a lab-like environment. Unpredictability and uncontrolled environment is what we desire. I talked with some trustees about this particular toipic and they told: we intend to bring robots (science) to people, no robots to science

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/RoboCupAtHome/RuleBook/issues/201#issuecomment-232712399, or mute the thread https://github.com/notifications/unsubscribe/AHTpOr7M9usrwxgA9p3HCgfjWBeh5fBDks5qVl78gaJpZM4JKvfM .

LoyVanBeek commented 8 years ago

Just to note: this came up during the TC meeting and I placed it here from my notes. @balkce has some good suggestions, I hope the TC member who brought it up (not sure) has more suggestions.

balkce commented 8 years ago

The TC member that brought it up is Dr. Jose M. Carranza @josemcr0506

Jose, you're up.

josemcr0506 commented 8 years ago

Hi all, I will write some thoughts on this soon, this week the term is ending here in my institute so we are full with admin stuff, so apologies if I hold back the conversation.

However, very quickly I can say I do agree with what Caleb has stated and my intention, when I said to be more scientific, was exactly to propose rules with clear description of the task, clear restrictions and where the level of difficulty is well understood, rather than ambiguous descriptions of a task whose interpretation if left to "common sense".

If the description of a task or challenge is clear then the answer to genuine questions on the task should not be "use common sense". I can understand judges did not want to give away too much information about the task in order to avoid any kind of tailoring or ad-hoc tuning of the robots by the contestant teams, however the better the task is written the less if left to interpretation.

I will propose some ideas soon aiming at providing an example of what I mean and hoping I am clear my self in terms of what I would like to see in terms of task description and evaluation.

Best,

swayshuai commented 8 years ago

I think there are some different between "scientific in research" and "scientific in rules". "scientific in rules" can be divided into two parts. First, we need to make sure that the scores can accurately reflect the teams' ability. A team which have more basic abilities can always get more scores in the task. Next, the task we designed can accurately test the robots, for instance, using a native speaker in speech reg. test, try more attempts to reduce the fortuity and so on....

However, I still have my own opinions on scores and tasks in basic abilities. My preliminary idea is that we should encourage the teams to try harder actions like opening the door, pickling up small items, and so on. The goal "Opening the door" has been put forward two years. However, no team is willing to develop this ability in Navigation Test. Only ERASER use this ability in opening demonstration. If we can't guide the teams to achieve our long- and mid-term goals, our jobs in rulebook is meaningless. I'll start a issue on it as soon as possible after I have clearer ideas.

LoyVanBeek commented 8 years ago

Team ToBi also opened a door, to be clear. It was largely optional though and to push abilities forward, they should not be optional (opposed to #127 )

kyordhel commented 8 years ago

Dear all,

[As Caleb said, strap yourselves in] I'm still not getting what do you intent to mean with scientific, nor I feel this discussion is leading to the point. Let me elaborate.

It has been claimed task descriptions should be clear and unambiguous. Please note that 2015 and 2016 rulebooks was written mostly by Loy, Me, Caleb, and Sven. We are all preconditioned to western schools of thought. Yet we had our problems trying to establish clear rules because, what a German and Dutch (as Europeans) find obvious for a home environment, Mexicans conceive in a different manner (e.g. European v/s USA standards). Considering Eastern cultures makes it more challenging. For instance, the sentence "offer a drink to X" is interpreted by us as "ask which drink X want, if any", and for some Japanese teams as "deliver a drink to X", therefore robot asked which drink to the operator and not to the guest.
It has been claimed that rulebook is ambiguous and require common sense for interpretation. The speaker only says what they know the hearer ignores about the topic, making 80%+ of spoken sentences ambiguous (known fact I can tell, since my research focuses on natural language interfaces). Making a rulebook completely unambiguous would require task descriptions with an infinite number of words and we are constrained to make the tests as short as possible. Nobody reads but the scoresheet, anyway.
About the Scientificness of the rules. The FIFA and the Olympics have their rules and they are clear, well defined, and strict, but none of them are scientific in any sense. And still, we require arbiters and judges that make considerations for scoring.
On scoring. I will call to the purpose of the league. Are we here to score points or to close the gap between humans and robots? Points are one way to measure performance, not the goal. IMO, if I command your robot to bring me a cold beer, I expect the robot to take from the fridge (which is closed) the one that is cold and not the one I stored there for cooling half an hour ago. That's something at least my trained dog can do.
On abilities (this is an opinion I share with Tijn van der Zant and Luca Iocchi). Robotics is not about manipulation, computer vision, navigation, task planning, and speech recognition; is about the integration of the former to solve complex tasks in dynamic and unpredictable environments. After watching robots solving the amazon challenge, I don't see the point of having a Manipulation test which practically duplicates that league.
The door. This topic does not fit here but, if ToBI and eR@sers could open the door (and I have evidence homer@UniKoblenz and Pumas also succeeded), we should expect teams to share knowledge and have it as mandatory for all teams.

As final remark. If the intention is to make it possible to produce publications out of each test run instead of actually solving house chores, we are making something wrong.

balkce commented 8 years ago

Mau's first two points are basically about unambiguity (having a complete definition of the test) vs ambiguity (having a complete definition is impossible, since there is always something that we'll miss describing). This, actually, is one of the bigger discussions we always have in writing a rulebook, regardless of the "scientificness" behind it. So, this is another discussion in my opinion.

I don't know where the other four points came from, but I agree with them.

The final remark is, I think, one of the things that we need to bridge: for something to be "scientific" does not necessarily mean something from which to obtain a publication. Children carry out experiments in a scientific manner in a biology/chemistry lab; it doesn't make their results publishable. When I talk about results for a publication, it's because I follow a certain evaluation methodology from which, if the results are good and worth while, they may present a journal paper. For the competition, however, I could care less for having worth while results for a journal paper: I agree, that's not what the competition is for. But having a sensible, repeatable, evaluation methodology that provides the teams as well as the audience a clear picture of which robot is "better" can be also regarded as "scientific". I think we all agree on this, actually, is just that the terminology is throwing us off.

kyordhel commented 8 years ago

Caleb,

All points are excerpts from previous comments. I found unnecessary to make citations and finger-point who said what. I also think we agree on that final remark. If someone want to make publications out of what happened in @Home, well, the recorded data is there, raw and open.

RoboCupAtHome / RuleBook

Make tests more "scientific" #201