davidjsherman / aseba

Aseba is a set of tools that allow beginners to program robots easily and efficiently.
http://aseba.wikidot.com
GNU Lesser General Public License v3.0
3 stars 2 forks source link

Asebascratch has inconsistent behavior with Thymio RF #23

Open davidjsherman opened 8 years ago

davidjsherman commented 8 years ago

Asebascratch works reliably when connected to the robot over USB, even when daisy-chained behind switches (asebaswitch, asebahttp, or asebaswitch). However, when connected to a robot using the wireless RF connection, its behavior is unreliable.

To separate this problem from job control (issue #22), commits 5914ea26a3cfe4e003c291e397e3b3299f8c4302 and 3fce45484cf4e300f06e5e88cdb07df8d1d5312e introduce message generators that consistently reproduce this behavior.

The attached file traces.tar.gz shows two runs of a newly powered-on Thymio RF robot reacting to the generated messages, either connected using USB (*.wired.*.txt) or connected using the RF dongle (*.rf.*.txt).

stephanemagnenat commented 8 years ago

Analysing the files *.wired.txt, in particular rec.wired.txt, I do not understand while the ASEBA_MESSAGE_LIST_NODES message (TIMESTAMP 0000 a011 2 05 00) happens every 200 to 300 ms. In itself it should not be a drama, but asebascratch is supposed to send them every 1000 ms, right?

stephanemagnenat commented 8 years ago

Note that the format of raw packets is SECOND.MILLISECOND SOURCE_ID TYPE LENGTH DATA_0 ... DATA_N where:

stephanemagnenat commented 8 years ago

Further analysing rec.rf.txt, line 121, it seems that the target answered 8 ASEBA_MESSAGE_NODE_PRESENT (0x900c) at once. The reason is that the ASEBA_MESSAGE_LIST_NODES was sent several times while the target was sending the description (and was blocked doing so) and these requests were queued and processed at once. An interesting element is the delay between these, about 15 ms. It seems quite long to be but maybe it's normal, because these are system messages that require a confirmation.

stephanemagnenat commented 8 years ago

Looking at the round-trip time for the first get variable in rec.rf.txt (line 293) and its answer (line 301), there is a delay of 43 ms, while for the wired (line 316 and 318 of ref.wired.txt) it is only 1 ms.

Moreover, in rec.rf.txt lines 293 to 300 we see that variable at address 0x0092 of size 17 is requested 4 times within an interval of 40 ms, even before the answer of the first request. Note that the message to request variables has identifier 0xa00b.

davidjsherman commented 8 years ago

HttpInterface::HttpInterface pings the network 20 times at 1-second intervals when starting up. Derived class ScratchInterface::ScratchInterface then pings the network once every second.

stephanemagnenat commented 8 years ago

Looking at the raw dump, there seems to be a problem with the timing, see my updated comment just before yours.

stephanemagnenat commented 8 years ago

It seems that the target does not manage to process incoming get variable requests:

while for wired they return the same number:

stephanemagnenat commented 8 years ago

Timing issues aside, a work-around (and probably a safe approach anyway) would be not to request again a variable (set of address, length) that has already been requested.

davidjsherman commented 8 years ago

I agree that there is a lot of latency in the responses, and dropped responses. The target seems not to be able to respond in time. But I reactivated R_state_update broadcasts and do not use any get variable requests. Here are new traces traces_broadcast.tar.gz

The log of the user message from 53191 shows many missing acknowledgment events in the rf trace, even though the outgoing messages (from 0) are the same in both traces. So it looks like the messages never arrive in the VM.

[edit] to make sure that it is not a bandwidth problem, I send R_state_update from the temperature event, that is, only once every second.

stephanemagnenat commented 8 years ago

I will look into these traces as well. I still do not understand why asebascratch is sending list nodes every 300 ms. Sending them while the target is answering its description fills the buffer there.

stephanemagnenat commented 8 years ago

A strange thing is, in rec.rf.txt, after line 158, the target does not answer any more the list nodes messages (0xa011) but still sends the message 0x0018 every second, like in rec.wired.txt.

stephanemagnenat commented 8 years ago

In rec.wired.txt the target only needs 1 ms to answer the outgoing messages (look for 0000 0000 and cfc7 0002), while in rec.rf.txt typically needs 25-30 ms, with a drop rate of 50%, while the bandwidth is about 100 bytes/s! Even worst, when the target does not answer the list nodes ping, the bandwidth is only about 70 bytes/s.

There seems to be something wrong with the target buffers indeed...

stephanemagnenat commented 8 years ago

I am wondering whether there is a bug in the target when it is queuing system messages while it is sending the description. Something interesting to try would be to stop sending list nodes messages once the description was requested until it is completely received, and see if this changes the behaviour later.

davidjsherman commented 8 years ago

OK. If I understand:

  1. During initialization in the constructor, stop sending pingNetwork as soon as descriptionReceived says that target has started to respond with a node description
  2. In run1s, don't ping the network at all (for now)
stephanemagnenat commented 8 years ago

Yes, at least to try. In the long run once we solved the current problem with the rf the ping should be sent, although no more than every 1s, as every node on the network will answer to this request. I just read the source code of the dongle, which is in my emails, and it seems it prints some errors on its serial port if a packet is dropped. Maybe Michael can access it on his development hardware.

stephanemagnenat commented 8 years ago

I found a bug in the function bool HttpInterface::run1s() in switches/http/http.cpp (your branch) at line 309: timeout is decreased too fast because in timeout -= (Aseba::UnifiedTime() - startTime).value; it is always decreased by the difference between the current time and the time of function entry. After this call, startTime should be assigned the current time. Maybe this bug is also the reason why step(timeout) did not work well?

stephanemagnenat commented 8 years ago

Actually this bug is my fault, I apologize. It is corrected in commit aseba-community/aseba@91d6b091fc9454f525d52ad89d6b7e7c5b13a953, excepted in http, because I do not want to create a conflict with your fork.

davidjsherman commented 8 years ago

Timeout bug fixed for PR #502 by commit 997ce4bb9bfb802c7c52c2a5882ea1e2e0c6e200

davidjsherman commented 8 years ago

Here are traces generated from an asebascratch that stops pinging the network as soon as one description starts to arrive. (In wired only 1 ping is necessary, in RF 5 pings are sent and 4 are answered; in both cases only one description is sent.)

sw.wired.txt:                                                              sw.rf.txt:
Found Thymio on port /dev/cu.usbmodem621                                   Found Thymio on port /dev/cu.usbmodem641
a011 list nodes from 0 : protocol version 5                                a011 list nodes from 0 : protocol version 5
a000 presence from 0 : protocol version 5                                  a000 presence from 0 : protocol version 5
900c node present from 53191 : protocol version 5                          a011 list nodes from 0 : protocol version 5
a010 get node description from 0 : dest 53191 protocol version 5           a000 presence from 0 : protocol version 5
9000 description from 53191 : Node thymio-II using protocol version 5      a011 list nodes from 0 : protocol version 5
bytecode: 1534, stack: 32, variables: 620                                  a000 presence from 0 : protocol version 5
named variables: 38                                                        a011 list nodes from 0 : protocol version 5
                                                                           a000 presence from 0 : protocol version 5
local events: 17                                                           900c node present from 53191 : protocol version 5
native functions: 49                                                       a010 get node description from 0 : dest 53191 protocol version 5
9001 named variable from 53191 : _id of size 1                             9000 description from 53191 : Node thymio-II using protocol version 5
9001 named variable from 53191 : event.source of size 1                    bytecode: 1534, stack: 32, variables: 620
9001 named variable from 53191 : event.args of size 32                     named variables: 38
9001 named variable from 53191 : _fwversion of size 2
9001 named variable from 53191 : _productId of size 1                      local events: 17
9001 named variable from 53191 : buttons._raw of size 5                    native functions: 49
9001 named variable from 53191 : button.backward of size 1                 9001 named variable from 53191 : _id of size 1

Traces in short_ping.tar.gz. Messages still seem to be dropped: Q_motion_added 1 100 163 163 0 and Q_motion_added 13 126 137 -137 1 are missing from the RF traces.

Average response time between 0000 0000 and cfc7 0002 is 1.25 ms for wired and 28.2 ms for rf.

stephanemagnenat commented 8 years ago

I added list node and get node description commands to asebacmd, and the dump of time to switch in commit aseba-community/aseba@a4cf68fadffb95370eddb792f6213de49e90f8c4. This should allow to reproduce the lack of answer from the node to the list node command outside the scratch context.

davidjsherman commented 8 years ago

Changing the frequency of the generated R_state_update user events doesn't have a significant effect. The traces in Hz.tar.gz compare wired and rf for updates emitted with temperature (1 Hz), prox (10 Hz), buttons (20 Hz), and motor (100Hz). Messages are still dropped more or less randomly. It doesn't seem to be a bandwidth issue.

         wired:                                  rf:
  1Hz:   Q_motion_added 1 100 163 163 0       |  Q_motion_added 11 122 137 -137 0
  1Hz:   Q_motion_added 11 122 137 -137 1     <
  1Hz:   Q_motion_added 2 100 166 166 2       |  Q_motion_added 2 100 166 166 1
  1Hz:   Q_motion_added 12 124 137 -137 3     |  Q_motion_added 12 124 137 -137 2
  1Hz:   Q_motion_added 3 100 169 169 0       |  Q_motion_added 3 100 169 169 3
  1Hz:   Q_motion_added 13 126 137 -137 1     <
  1Hz:   Q_motion_added 4 100 172 172 2       |  Q_motion_added 4 100 172 172 0
  1Hz:   Q_motion_added 14 128 137 -137 3     |  Q_motion_added 14 128 137 -137 1

 10Hz:   Q_motion_added 1 100 163 163 0          Q_motion_added 1 100 163 163 0
 10Hz:   Q_motion_added 11 122 137 -137 1     <
 10Hz:   Q_motion_added 2 100 166 166 2       |  Q_motion_added 2 100 166 166 1
 10Hz:   Q_motion_added 12 124 137 -137 3     <
 10Hz:   Q_motion_added 3 100 169 169 0       <
 10Hz:   Q_motion_added 13 126 137 -137 1     <
 10Hz:   Q_motion_added 4 100 172 172 2       |  Q_motion_added 4 100 172 172 3
 10Hz:   Q_motion_added 14 128 137 -137 3     |  Q_motion_added 14 128 137 -137 0

 20Hz:   Q_motion_added 1 100 163 163 0          Q_motion_added 1 100 163 163 0
 20Hz:   Q_motion_added 11 122 137 -137 1        Q_motion_added 11 122 137 -137 1
 20Hz:   Q_motion_added 2 100 166 166 2       <
 20Hz:   Q_motion_added 12 124 137 -137 3     |  Q_motion_added 12 124 137 -137 2
 20Hz:   Q_motion_added 3 100 169 169 0       <
 20Hz:   Q_motion_added 13 126 137 -137 1     <
 20Hz:   Q_motion_added 4 100 172 172 2       |  Q_motion_added 4 100 172 172 3
 20Hz:   Q_motion_added 14 128 137 -137 3     <

100Hz:   Q_motion_added 1 100 163 163 0          Q_motion_added 1 100 163 163 0
100Hz:   Q_motion_added 11 122 137 -137 1        Q_motion_added 11 122 137 -137 1
100Hz:   Q_motion_added 2 100 166 166 2       <
100Hz:   Q_motion_added 12 124 137 -137 3     <
100Hz:   Q_motion_added 3 100 169 169 0       <
100Hz:   Q_motion_added 13 126 137 -137 1     <
100Hz:   Q_motion_added 4 100 172 172 2          Q_motion_added 4 100 172 172 2
100Hz:   Q_motion_added 14 128 137 -137 3     <
stephanemagnenat commented 8 years ago

Michael and I cannot reproduce the lack of initial answer to list nodes. We tried this setup:

In our test setup the robot answers directly. Could you try this with yours?

davidjsherman commented 8 years ago

OK. If you want to try mine, I am generating with run-record.sh from 25f7755dd76642e2a578db5c3319f6e842c1f732 in branch test-rf-noping. It runs:

asebaswitch -d 'ser:name=Thymio' > ${DIR}/sw.${SET}.txt &
asebarec 'tcp:;port=33333' > ${DIR}/rec.${SET}.txt &
../Debug/asebascratch --http 3002 --aesl ${AESL} 'tcp:;port=33333' &
sleep 30
curl --silent --output ${DIR}/ev.${SET}.txt http://localhost:3002/events &
perl run-generate-messages.perl
davidjsherman commented 8 years ago

Is this what you mean?

% ./switches/switch/Debug/asebaswitch "ser:name=Thymio" -d -v --rawtime & ./clients/cmd/Debug/asebacmd listnodes
Found Thymio on port /dev/cu.usbmodem621
1463148767.264 * Incoming connection from ser:name=Thymio;baud=115200;stop=1;parity=none;fc=none;bits=8
1463148767.266 * Incoming connection from tcp:host=localhost;port=62485;connectionPort=33333
1463148767.266   a011 list nodes from 0 : protocol version 5
[Fri May 13 16:12:47 2016 267] * Normal connection closed to tcp:host=localhost;port=62485;connectionPort=33333
1463148767.308   900c node present from 53191 : protocol version 5
stephanemagnenat commented 8 years ago

Yes, in some cases the robot stops answering list nodes. We managed to trigger this with studio connecting to an RF Thymio, but still do not manage to reproduce it repeatedly.

davidjsherman commented 8 years ago

Is there a way to see whether the robot is dropping packets? The best that I can do, is immediately send an acknowledgment message when I receive one (see for example the Aesl program thymio_motion_ack.aesl). Because I was wondering whether there was a bug in my Aesl program, that was exposed by RF latency. But the ack_Q_ message is always followed by the corresponding Q_ message. So I don't think my program has put the VM in a loop where it can't receive messages.

stephanemagnenat commented 8 years ago

We have managed to reproduce a stop of description reception if a robot is "far" from an RF point of view. We are now trying with two robots, one near one far, to see if the problem is in the robot or in the dongle.

davidjsherman commented 8 years ago

Are you able to play back any of the recording that I have made?

stephanemagnenat commented 8 years ago

I am working remotely from home (with a switch on Michael's computer) and have to leave in 5 minutes. We unfortunately had no time to replay your records. I'll have a look at the source code of the dongle during the week-end. There might be more than one bug leading to the strange global behaviour we observe, but it is hard to say. I hope we can find minimal reproducible test cases soon.

stephanemagnenat commented 8 years ago

The only current way to see if the dongle looses packets is an UART pin available on its programming connector. I do not know if a similar pin is available in the robot RF module itself. @mbonani?

In parallel, for safety I think asebahttp (and all other aseba programs having a similar behaviour) should ideally be modified not to request variables that have already been requested, to avoid queuing such requests as an effect of the latency of these queries.

stephanemagnenat commented 8 years ago

From the source code of the dongle, list nodes messages are throated as user messages, so the bug preventing them to be received (if it's the case) might be similar to the one leading your Q_ messages to be dropped.

It would be interesting to see if the dropping of user messages still happen if the dongle is unplugged and replugged, (but scratch not relaunched) that is, if it has not received a large number of system messages (the node description).

davidjsherman commented 8 years ago

OK for completeness I will add a pending flag and a last update time to every variable known to asebahttp/scratch, but remember that the traces generated above, except the very first set, do not request any variables (0xa00b get variables) nor does the robot send any variables (0x9005 variables)...

stephanemagnenat commented 8 years ago

I will add a pending flag and a last update time to every variable known to asebahttp/scratch

I think indeed for safety it is good, by variable I mean the set (start_addr, length). But I agree that there is another problem with the RF.

stephanemagnenat commented 8 years ago

Do you have a minimalist aesl code that shows the drop of user messages, but does not use asebahttp or asebascratch? I would like to build a minimal test case that reproduces the problem, so that we can test it in a variety of conditions.

I imagine some aesl code that does some form of ping that we could trigger with a simple client, either a shell script and asebacmd, or medulla and python, or a simple C++ program.

stephanemagnenat commented 8 years ago

I checked the source code of the I2C RF module inside the Thymio, it also outputs information about dropped packets on an UART. So I suggest to build a debug setup in which both the dongle and the module inside a Thymio are connected by UART to a computer, and try to build a minimal test case that illustrates the problem with a good repeatability. With that we should be able to identify, and hopefully fix, the probable bug.

davidjsherman commented 8 years ago

I did a lot of testing, and conclude that there are three separate issues here.

  1. Flow control between Scratch, asebascratch, and thymio_motion.aesl. The _busy list of threads can get out of sync when communications are slow (with RF) and Scratch can overrun the Qid thread queue in thymio_motion.aesl. Queue overruns look like dropped packets! To do: make thread control more robust in asebascratch and thymio_motion.aesl.
  2. Constraints arising from the RF communications protocol, such delaying after node description is sent, and not re-requesting variables. Respecting these constraints avoids dropped messages. To do: discover these constraints, then enforce them in the client, although possibly could be addressed by flow control on the RF dongle.
  3. Other dropped messages in RF communication. Seems dependent on the frequency of message transmission. To do: produce traces for asebaplay.

To my mind these issues are listed in decreasing order of urgency. For (2), I think some constraints are known, but I am writing test code to discover them empirically.

stephanemagnenat commented 8 years ago

Regarding 3, the experiments we made with @mbonani last Friday seem to highlight a bug in the dongle, or the RF module of a Thymio (but less likely). As far as we have observed, this bug is that after a specific condition, no system messages are received from a given node. This condition seems to be related to a bad transmission and therefore probably a packet loss in the physical layer. It is not clear yet:

@mbonani is currently building a setup with the logging of the UART of the dongle, in addition I hope we can soon find a minimal test case, so that we can clearly understand what is happening.

davidjsherman commented 8 years ago

Master branch (75dd027eff6bf6343e368faf74c3c76111d67525 ) now has mock-asebascratch, a simple C++ program that interacts with a Thymio to generate a predictable sequence of Aseba messages reproducing those generated by Scratch through Asebascratch. The generator draws a simple geometric figure, using a mix of motion operations, which block until completed, and instantaneous operations (lights and sounds). These operations and the protocol used for blocking operations are defined in thymio_motion.aesl. It needs to be running on the robot – I use a vmcode.abo on the SD card so that I can reset the robot before each run.

This is not designed as a stress test for Aseba, but as a faithful reproduction of a dialogue between Scratch and the Thymio. The message traces in inirobot/tests/aesl-wired-rf compare the timings of a real Scratch program talking to the robot, with those produced by mock-asebascratch.

Message traces in inirobot/tests/aesl-wired-rf show 13% of the blocking operations time out when using RF. If it is a bug in the Aesl code, then it is one that depends on some property of RF communications.

Mock-scratch, imitate Scratch protocol with thymio_motion.aesl, usage:
mock-asebascratch [options] [targets]*
Options:
-k ITERATIONS   : run for ITERATIONS steps (default: 13)
-t TIMEOUT      : maximum wait for motion event (default: 20000 ms)
-w WAIT         : wait between iterations (default: 0 ms)

The original Scratch program whose messages I traced is: spiro small x52

stephanemagnenat commented 8 years ago

Here is a simple test for detecting user messages dropping between two robots in RF. First define two global events, PingA and PingB, each with one argument. Optionally, these events may have more arguments filled with zeros, to stress test the bandwidth. For instance, using 2 arguments and doing emit PingB [counterB, 0] sends messages of 10 bytes long, allowing an easy computation of the bandwidth.

Then, write this code in the first robot:

var counterA = 0
var counterB = 0

onevent PingA
if event.args[0] == counterA + 1 then
    call leds.top(0, 32, 0)
else
    call leds.top(32, 0, 0)
end
counterA = event.args[0]
call leds.circle(
     ((counterA >> 7) & 1) * 32,
     ((counterA >> 6) & 1) * 32,
     ((counterA >> 5) & 1) * 32,
     ((counterA >> 4) & 1) * 32,
     ((counterA >> 3) & 1) * 32,
     ((counterA >> 2) & 1) * 32,
     ((counterA >> 1) & 1) * 32,
     ((counterA >> 0) & 1) * 32
)

onevent timer0
emit PingB counterB
counterB += 1

onevent button.left
if button.left == 1 then
    timer.period[0] -= 5
end
onevent button.right
if button.right == 1 then
    timer.period[0] += 5
end
onevent button.forward
if button.forward == 1 then
    timer.period[0] = 50
end
onevent button.backward
if  button.backward == 1 then
    timer.period[0] = 0
end

and this code in the second robot:

var counterA = 0
var counterB = 0

onevent PingB
if event.args[0] == counterB + 1 then
    call leds.top(0, 32, 0)
else
    call leds.top(32, 0, 0)
end
counterB = event.args[0]
call leds.circle(
     ((counterB >> 7) & 1) * 32,
     ((counterB >> 6) & 1) * 32,
     ((counterB >> 5) & 1) * 32,
     ((counterB >> 4) & 1) * 32,
     ((counterB >> 3) & 1) * 32,
     ((counterB >> 2) & 1) * 32,
     ((counterB >> 1) & 1) * 32,
     ((counterB >> 0) & 1) * 32
)

onevent timer0
emit PingA counterA
counterA += 1

onevent button.left
if button.left == 1 then
    timer.period[0] -= 5
end
onevent button.right
if button.right == 1 then
    timer.period[0] += 5
end
onevent button.forward
if button.forward == 1 then
    timer.period[0] = 50
end
onevent button.backward
if  button.backward == 1 then
    timer.period[0] = 0
end

EDIT: tests using two Thymios show similar usable bandwidth as in the original RF paper using user messages (http://infoscience.epfl.ch/record/188402/files/thymiorf-icia2013-final.pdf). Also, it seems that the drop rate of user messages increase a lot when using debug messages, such as the "GetVariable" through Studio's "auto refresh" feature.

stephanemagnenat commented 8 years ago

Regarding your last message @davidjsherman and our latest tests:

mbonani commented 8 years ago

some update after testing: I test communication between two robots using previous code making motor turning or not I see no real influence on the radio if motor was turning or not. But I see strange case where robot was no receiving data at all meantime other was sending. I then increase datarate and it was working... I will made more test

mbonani commented 8 years ago

hello, I made some test directly with last scratch I found on the VM today. I take asebascratch and asebahttp and thymiomotion. Nothing really works. Probably I have copy something that is not compatible.

In last thymiomotion, I see also is that you make the compressing in onvent "buttons" and send R_state with envent temperature at 1 Hz.

I make some simple test see this code

var time

timer.period[0]=1

onevent timer0
time++

onevent buttons
emit event time
time=0

I discover that the event buttons is fired at 50Hz. It is comment at 10Hz in your code and it is 20Hz in our wiki (that is wrong). So I wonder if you mismatch the emitting frequencies of the data and that explain why you saturate the wireless and began to lose datas

Between a thymio and dummy node I have no problem until 33Hz for data (exchange of buffer size 27) at 40 Hz it begins to loose. Between two robots it is a little better could work fine until 40Hz.

I suggest to be sure to be sure to not emit more than 25Hz and see if it is enough for scratch using only user message.

davidjsherman commented 8 years ago

It does not seem to be interference from the motors. Running the spirograph example with fake motors (so that odometery works) still leads to many dropped messages. fake motion

davidjsherman commented 8 years ago

On the other hand, blindly resending Q_* messages 200 ms later leads to only 1 dropped message out of 120. Resending is all right because thymio_motion.aesl ignores duplicate tasks (davidjsherman/inirobot-scratch-thymioII@a6724c4a960ecc9b0cbed0f519b939e677240d91). So adding a simple handshaking protocol seems like the surest bet. drop 1 out of 120

davidjsherman commented 8 years ago

I have implemented in 81675caf020f0301314fb825532a09498c290d68 an ad hoc handshaking scheme for the motion events in asebascratch, that resends messages until they are acknowledged or time out. It seems to work. If this seems reliable, perhaps it can be adapted into a more general mechanism, at least for a subset of messages that are deemed important. (Which ones?) spiro success

mbonani commented 8 years ago

Hi, this is a very good news. I do not think flow of sensor have to be handshaked. Only message that are mandatory; in the thymio-motion I think only qmotion have to. Otherwise make something more general is not necessary before we handle more deeply the wireless module. We know the limitation of it, we have a working solution for asebascratch on wireless and we have now to make it tested by users. I will make some test next week. Thank you for your hard work!

mbonani commented 8 years ago

I just see the package Scratch to Thymio, this help a lot for testing. Thanks

ghost commented 8 years ago

I can test too. Where is the package ? Thanks, Didier

davidjsherman commented 8 years ago

I have put up pre-release 0.7.2-alpha with the very latest corrections. In addition to the Scratch2-ThymioII folders I have added unofficial packages of the complete Aseba system including Scratch. If you want the nightly builds I can give you access to our build platform at http://ci.inria.fr.

ghost commented 8 years ago

Ok. Can you give me access to ci.inria.fr ? Thank you !