Open davidjsherman opened 8 years ago
Analysing the files *.wired.txt
, in particular rec.wired.txt
, I do not understand while the ASEBA_MESSAGE_LIST_NODES
message (TIMESTAMP 0000 a011 2 05 00
) happens every 200 to 300 ms. In itself it should not be a drama, but asebascratch is supposed to send them every 1000 ms, right?
Note that the format of raw packets is SECOND.MILLISECOND SOURCE_ID TYPE LENGTH DATA_0 ... DATA_N
where:
SECOND
is number of second since epochMILLISECOND
is the number of millisecond in the current secondSOURCE_ID
is the uint16
source node identifier in hexadecimalTYPE
is the uint16
message type in hexadecimalLENGTH
is the length of payload in number of bytesDATA_x
is the x-th byte of the payloadFurther analysing rec.rf.txt
, line 121, it seems that the target answered 8 ASEBA_MESSAGE_NODE_PRESENT
(0x900c
) at once. The reason is that the ASEBA_MESSAGE_LIST_NODES
was sent several times while the target was sending the description (and was blocked doing so) and these requests were queued and processed at once. An interesting element is the delay between these, about 15 ms. It seems quite long to be but maybe it's normal, because these are system messages that require a confirmation.
Looking at the round-trip time for the first get variable in rec.rf.txt
(line 293) and its answer (line 301), there is a delay of 43 ms, while for the wired (line 316 and 318 of ref.wired.txt
) it is only 1 ms.
Moreover, in rec.rf.txt
lines 293 to 300 we see that variable at address 0x0092 of size 17 is requested 4 times within an interval of 40 ms, even before the answer of the first request. Note that the message to request variables has identifier 0xa00b
.
HttpInterface::HttpInterface pings the network 20 times at 1-second intervals when starting up. Derived class ScratchInterface::ScratchInterface then pings the network once every second.
Looking at the raw dump, there seems to be a problem with the timing, see my updated comment just before yours.
It seems that the target does not manage to process incoming get variable requests:
cat rec.rf.txt | grep "0000 a00b" | wc -l
returns 534cat rec.rf.txt | grep "cfc7 9005" | wc -l
returns 261while for wired they return the same number:
cat rec.wired.txt | grep "0000 a00b" | wc -l
returns 572cat rec.wired.txt | grep "cfc7 9005" | wc -l
returns 572Timing issues aside, a work-around (and probably a safe approach anyway) would be not to request again a variable (set of address, length) that has already been requested.
I agree that there is a lot of latency in the responses, and dropped responses. The target seems not to be able to respond in time. But I reactivated R_state_update broadcasts and do not use any get variable requests. Here are new traces traces_broadcast.tar.gz
The log of the user message from 53191
shows many missing acknowledgment events in the rf trace, even though the outgoing messages (from 0) are the same in both traces. So it looks like the messages never arrive in the VM.
[edit] to make sure that it is not a bandwidth problem, I send R_state_update from the temperature event, that is, only once every second.
I will look into these traces as well. I still do not understand why asebascratch is sending list nodes every 300 ms. Sending them while the target is answering its description fills the buffer there.
A strange thing is, in rec.rf.txt
, after line 158, the target does not answer any more the list nodes messages (0xa011
) but still sends the message 0x0018
every second, like in rec.wired.txt
.
In rec.wired.txt
the target only needs 1 ms to answer the outgoing messages (look for 0000 0000
and cfc7 0002
), while in rec.rf.txt
typically needs 25-30 ms, with a drop rate of 50%, while the bandwidth is about 100 bytes/s! Even worst, when the target does not answer the list nodes ping, the bandwidth is only about 70 bytes/s.
There seems to be something wrong with the target buffers indeed...
I am wondering whether there is a bug in the target when it is queuing system messages while it is sending the description. Something interesting to try would be to stop sending list nodes messages once the description was requested until it is completely received, and see if this changes the behaviour later.
OK. If I understand:
Yes, at least to try. In the long run once we solved the current problem with the rf the ping should be sent, although no more than every 1s, as every node on the network will answer to this request. I just read the source code of the dongle, which is in my emails, and it seems it prints some errors on its serial port if a packet is dropped. Maybe Michael can access it on his development hardware.
I found a bug in the function bool HttpInterface::run1s()
in switches/http/http.cpp
(your branch) at line 309: timeout
is decreased too fast because in timeout -= (Aseba::UnifiedTime() - startTime).value;
it is always decreased by the difference between the current time and the time of function entry. After this call, startTime
should be assigned the current time. Maybe this bug is also the reason why step(timeout)
did not work well?
Actually this bug is my fault, I apologize. It is corrected in commit aseba-community/aseba@91d6b091fc9454f525d52ad89d6b7e7c5b13a953, excepted in http, because I do not want to create a conflict with your fork.
Timeout bug fixed for PR #502 by commit 997ce4bb9bfb802c7c52c2a5882ea1e2e0c6e200
Here are traces generated from an asebascratch that stops pinging the network as soon as one description starts to arrive. (In wired only 1 ping is necessary, in RF 5 pings are sent and 4 are answered; in both cases only one description is sent.)
sw.wired.txt: sw.rf.txt:
Found Thymio on port /dev/cu.usbmodem621 Found Thymio on port /dev/cu.usbmodem641
a011 list nodes from 0 : protocol version 5 a011 list nodes from 0 : protocol version 5
a000 presence from 0 : protocol version 5 a000 presence from 0 : protocol version 5
900c node present from 53191 : protocol version 5 a011 list nodes from 0 : protocol version 5
a010 get node description from 0 : dest 53191 protocol version 5 a000 presence from 0 : protocol version 5
9000 description from 53191 : Node thymio-II using protocol version 5 a011 list nodes from 0 : protocol version 5
bytecode: 1534, stack: 32, variables: 620 a000 presence from 0 : protocol version 5
named variables: 38 a011 list nodes from 0 : protocol version 5
a000 presence from 0 : protocol version 5
local events: 17 900c node present from 53191 : protocol version 5
native functions: 49 a010 get node description from 0 : dest 53191 protocol version 5
9001 named variable from 53191 : _id of size 1 9000 description from 53191 : Node thymio-II using protocol version 5
9001 named variable from 53191 : event.source of size 1 bytecode: 1534, stack: 32, variables: 620
9001 named variable from 53191 : event.args of size 32 named variables: 38
9001 named variable from 53191 : _fwversion of size 2
9001 named variable from 53191 : _productId of size 1 local events: 17
9001 named variable from 53191 : buttons._raw of size 5 native functions: 49
9001 named variable from 53191 : button.backward of size 1 9001 named variable from 53191 : _id of size 1
Traces in short_ping.tar.gz. Messages still seem to be dropped: Q_motion_added 1 100 163 163 0
and Q_motion_added 13 126 137 -137 1
are missing from the RF traces.
Average response time between 0000 0000
and cfc7 0002
is 1.25 ms for wired and 28.2 ms for rf.
I added list node and get node description commands to asebacmd
, and the dump of time to switch in commit aseba-community/aseba@a4cf68fadffb95370eddb792f6213de49e90f8c4. This should allow to reproduce the lack of answer from the node to the list node command outside the scratch context.
Changing the frequency of the generated R_state_update
user events doesn't have a significant effect.
The traces in Hz.tar.gz compare wired and rf for updates emitted with temperature
(1 Hz), prox
(10 Hz), buttons
(20 Hz), and motor
(100Hz). Messages are still dropped more or less randomly.
It doesn't seem to be a bandwidth issue.
wired: rf:
1Hz: Q_motion_added 1 100 163 163 0 | Q_motion_added 11 122 137 -137 0
1Hz: Q_motion_added 11 122 137 -137 1 <
1Hz: Q_motion_added 2 100 166 166 2 | Q_motion_added 2 100 166 166 1
1Hz: Q_motion_added 12 124 137 -137 3 | Q_motion_added 12 124 137 -137 2
1Hz: Q_motion_added 3 100 169 169 0 | Q_motion_added 3 100 169 169 3
1Hz: Q_motion_added 13 126 137 -137 1 <
1Hz: Q_motion_added 4 100 172 172 2 | Q_motion_added 4 100 172 172 0
1Hz: Q_motion_added 14 128 137 -137 3 | Q_motion_added 14 128 137 -137 1
10Hz: Q_motion_added 1 100 163 163 0 Q_motion_added 1 100 163 163 0
10Hz: Q_motion_added 11 122 137 -137 1 <
10Hz: Q_motion_added 2 100 166 166 2 | Q_motion_added 2 100 166 166 1
10Hz: Q_motion_added 12 124 137 -137 3 <
10Hz: Q_motion_added 3 100 169 169 0 <
10Hz: Q_motion_added 13 126 137 -137 1 <
10Hz: Q_motion_added 4 100 172 172 2 | Q_motion_added 4 100 172 172 3
10Hz: Q_motion_added 14 128 137 -137 3 | Q_motion_added 14 128 137 -137 0
20Hz: Q_motion_added 1 100 163 163 0 Q_motion_added 1 100 163 163 0
20Hz: Q_motion_added 11 122 137 -137 1 Q_motion_added 11 122 137 -137 1
20Hz: Q_motion_added 2 100 166 166 2 <
20Hz: Q_motion_added 12 124 137 -137 3 | Q_motion_added 12 124 137 -137 2
20Hz: Q_motion_added 3 100 169 169 0 <
20Hz: Q_motion_added 13 126 137 -137 1 <
20Hz: Q_motion_added 4 100 172 172 2 | Q_motion_added 4 100 172 172 3
20Hz: Q_motion_added 14 128 137 -137 3 <
100Hz: Q_motion_added 1 100 163 163 0 Q_motion_added 1 100 163 163 0
100Hz: Q_motion_added 11 122 137 -137 1 Q_motion_added 11 122 137 -137 1
100Hz: Q_motion_added 2 100 166 166 2 <
100Hz: Q_motion_added 12 124 137 -137 3 <
100Hz: Q_motion_added 3 100 169 169 0 <
100Hz: Q_motion_added 13 126 137 -137 1 <
100Hz: Q_motion_added 4 100 172 172 2 Q_motion_added 4 100 172 172 2
100Hz: Q_motion_added 14 128 137 -137 3 <
Michael and I cannot reproduce the lack of initial answer to list nodes. We tried this setup:
asebaswitch "ser:name=Thymio-II" -d -v --rawtime
asebacmd listnodes
In our test setup the robot answers directly. Could you try this with yours?
OK. If you want to try mine, I am generating with run-record.sh from 25f7755dd76642e2a578db5c3319f6e842c1f732 in branch test-rf-noping. It runs:
asebaswitch -d 'ser:name=Thymio' > ${DIR}/sw.${SET}.txt &
asebarec 'tcp:;port=33333' > ${DIR}/rec.${SET}.txt &
../Debug/asebascratch --http 3002 --aesl ${AESL} 'tcp:;port=33333' &
sleep 30
curl --silent --output ${DIR}/ev.${SET}.txt http://localhost:3002/events &
perl run-generate-messages.perl
Is this what you mean?
% ./switches/switch/Debug/asebaswitch "ser:name=Thymio" -d -v --rawtime & ./clients/cmd/Debug/asebacmd listnodes
Found Thymio on port /dev/cu.usbmodem621
1463148767.264 * Incoming connection from ser:name=Thymio;baud=115200;stop=1;parity=none;fc=none;bits=8
1463148767.266 * Incoming connection from tcp:host=localhost;port=62485;connectionPort=33333
1463148767.266 a011 list nodes from 0 : protocol version 5
[Fri May 13 16:12:47 2016 267] * Normal connection closed to tcp:host=localhost;port=62485;connectionPort=33333
1463148767.308 900c node present from 53191 : protocol version 5
Yes, in some cases the robot stops answering list nodes. We managed to trigger this with studio connecting to an RF Thymio, but still do not manage to reproduce it repeatedly.
Is there a way to see whether the robot is dropping packets?
The best that I can do, is immediately send an acknowledgment message when I receive one (see for example the Aesl program thymio_motion_ack.aesl). Because I was wondering whether there was a bug in my Aesl program, that was exposed by RF latency. But the ack_Q_
message is always followed by the corresponding Q_
message. So I don't think my program has put the VM in a loop where it can't receive messages.
We have managed to reproduce a stop of description reception if a robot is "far" from an RF point of view. We are now trying with two robots, one near one far, to see if the problem is in the robot or in the dongle.
Are you able to play back any of the recording that I have made?
I am working remotely from home (with a switch on Michael's computer) and have to leave in 5 minutes. We unfortunately had no time to replay your records. I'll have a look at the source code of the dongle during the week-end. There might be more than one bug leading to the strange global behaviour we observe, but it is hard to say. I hope we can find minimal reproducible test cases soon.
The only current way to see if the dongle looses packets is an UART pin available on its programming connector. I do not know if a similar pin is available in the robot RF module itself. @mbonani?
In parallel, for safety I think asebahttp (and all other aseba programs having a similar behaviour) should ideally be modified not to request variables that have already been requested, to avoid queuing such requests as an effect of the latency of these queries.
From the source code of the dongle, list nodes messages are throated as user messages, so the bug preventing them to be received (if it's the case) might be similar to the one leading your Q_ messages to be dropped.
It would be interesting to see if the dropping of user messages still happen if the dongle is unplugged and replugged, (but scratch not relaunched) that is, if it has not received a large number of system messages (the node description).
OK for completeness I will add a pending flag and a last update time to every variable known to asebahttp/scratch, but remember that the traces generated above, except the very first set, do not request any variables (0xa00b get variables) nor does the robot send any variables (0x9005 variables)...
I will add a pending flag and a last update time to every variable known to asebahttp/scratch
I think indeed for safety it is good, by variable I mean the set (start_addr, length). But I agree that there is another problem with the RF.
Do you have a minimalist aesl code that shows the drop of user messages, but does not use asebahttp or asebascratch? I would like to build a minimal test case that reproduces the problem, so that we can test it in a variety of conditions.
I imagine some aesl code that does some form of ping that we could trigger with a simple client, either a shell script and asebacmd, or medulla and python, or a simple C++ program.
I checked the source code of the I2C RF module inside the Thymio, it also outputs information about dropped packets on an UART. So I suggest to build a debug setup in which both the dongle and the module inside a Thymio are connected by UART to a computer, and try to build a minimal test case that illustrates the problem with a good repeatability. With that we should be able to identify, and hopefully fix, the probable bug.
I did a lot of testing, and conclude that there are three separate issues here.
_busy
list of threads can get out of sync when communications are slow (with RF) and Scratch can overrun the Qid thread queue in thymio_motion.aesl. Queue overruns look like dropped packets! To do: make thread control more robust in asebascratch and thymio_motion.aesl.To my mind these issues are listed in decreasing order of urgency. For (2), I think some constraints are known, but I am writing test code to discover them empirically.
Regarding 3, the experiments we made with @mbonani last Friday seem to highlight a bug in the dongle, or the RF module of a Thymio (but less likely). As far as we have observed, this bug is that after a specific condition, no system messages are received from a given node. This condition seems to be related to a bad transmission and therefore probably a packet loss in the physical layer. It is not clear yet:
@mbonani is currently building a setup with the logging of the UART of the dongle, in addition I hope we can soon find a minimal test case, so that we can clearly understand what is happening.
Master branch (75dd027eff6bf6343e368faf74c3c76111d67525 ) now has mock-asebascratch
, a simple C++ program that interacts with a Thymio to generate a predictable sequence of Aseba messages reproducing those generated by Scratch through Asebascratch. The generator draws a simple geometric figure, using a mix of motion operations, which block until completed, and instantaneous operations (lights and sounds). These operations and the protocol used for blocking operations are defined in thymio_motion.aesl. It needs to be running on the robot – I use a vmcode.abo on the SD card so that I can reset the robot before each run.
This is not designed as a stress test for Aseba, but as a faithful reproduction of a dialogue between Scratch and the Thymio. The message traces in inirobot/tests/aesl-wired-rf compare the timings of a real Scratch program talking to the robot, with those produced by mock-asebascratch.
Message traces in inirobot/tests/aesl-wired-rf show 13% of the blocking operations time out when using RF. If it is a bug in the Aesl code, then it is one that depends on some property of RF communications.
Mock-scratch, imitate Scratch protocol with thymio_motion.aesl, usage:
mock-asebascratch [options] [targets]*
Options:
-k ITERATIONS : run for ITERATIONS steps (default: 13)
-t TIMEOUT : maximum wait for motion event (default: 20000 ms)
-w WAIT : wait between iterations (default: 0 ms)
The original Scratch program whose messages I traced is:
Here is a simple test for detecting user messages dropping between two robots in RF. First define two global events, PingA
and PingB
, each with one argument. Optionally, these events may have more arguments filled with zeros, to stress test the bandwidth. For instance, using 2 arguments and doing emit PingB [counterB, 0]
sends messages of 10 bytes long, allowing an easy computation of the bandwidth.
Then, write this code in the first robot:
var counterA = 0
var counterB = 0
onevent PingA
if event.args[0] == counterA + 1 then
call leds.top(0, 32, 0)
else
call leds.top(32, 0, 0)
end
counterA = event.args[0]
call leds.circle(
((counterA >> 7) & 1) * 32,
((counterA >> 6) & 1) * 32,
((counterA >> 5) & 1) * 32,
((counterA >> 4) & 1) * 32,
((counterA >> 3) & 1) * 32,
((counterA >> 2) & 1) * 32,
((counterA >> 1) & 1) * 32,
((counterA >> 0) & 1) * 32
)
onevent timer0
emit PingB counterB
counterB += 1
onevent button.left
if button.left == 1 then
timer.period[0] -= 5
end
onevent button.right
if button.right == 1 then
timer.period[0] += 5
end
onevent button.forward
if button.forward == 1 then
timer.period[0] = 50
end
onevent button.backward
if button.backward == 1 then
timer.period[0] = 0
end
and this code in the second robot:
var counterA = 0
var counterB = 0
onevent PingB
if event.args[0] == counterB + 1 then
call leds.top(0, 32, 0)
else
call leds.top(32, 0, 0)
end
counterB = event.args[0]
call leds.circle(
((counterB >> 7) & 1) * 32,
((counterB >> 6) & 1) * 32,
((counterB >> 5) & 1) * 32,
((counterB >> 4) & 1) * 32,
((counterB >> 3) & 1) * 32,
((counterB >> 2) & 1) * 32,
((counterB >> 1) & 1) * 32,
((counterB >> 0) & 1) * 32
)
onevent timer0
emit PingA counterA
counterA += 1
onevent button.left
if button.left == 1 then
timer.period[0] -= 5
end
onevent button.right
if button.right == 1 then
timer.period[0] += 5
end
onevent button.forward
if button.forward == 1 then
timer.period[0] = 50
end
onevent button.backward
if button.backward == 1 then
timer.period[0] = 0
end
EDIT: tests using two Thymios show similar usable bandwidth as in the original RF paper using user messages (http://infoscience.epfl.ch/record/188402/files/thymiorf-icia2013-final.pdf). Also, it seems that the drop rate of user messages increase a lot when using debug messages, such as the "GetVariable" through Studio's "auto refresh" feature.
Regarding your last message @davidjsherman and our latest tests:
rec.mock-0ms-rf.txt
, why are there still list nodes messages (0000 a011
) every 300 ms?stream->write(...)
to block. When using a switch, this blocking is done in the switch, but the packets sent by Scratch/Studio might be buffered in the computer's network layer. I do not know how much this happens, but we have observed that when using Studio through a switch, GetVariable messages are still sent 10 seconds after Studio has disconnected.some update after testing: I test communication between two robots using previous code making motor turning or not I see no real influence on the radio if motor was turning or not. But I see strange case where robot was no receiving data at all meantime other was sending. I then increase datarate and it was working... I will made more test
hello, I made some test directly with last scratch I found on the VM today. I take asebascratch and asebahttp and thymiomotion. Nothing really works. Probably I have copy something that is not compatible.
In last thymiomotion, I see also is that you make the compressing in onvent "buttons" and send R_state with envent temperature at 1 Hz.
I make some simple test see this code
var time
timer.period[0]=1
onevent timer0
time++
onevent buttons
emit event time
time=0
I discover that the event buttons is fired at 50Hz. It is comment at 10Hz in your code and it is 20Hz in our wiki (that is wrong). So I wonder if you mismatch the emitting frequencies of the data and that explain why you saturate the wireless and began to lose datas
Between a thymio and dummy node I have no problem until 33Hz for data (exchange of buffer size 27) at 40 Hz it begins to loose. Between two robots it is a little better could work fine until 40Hz.
I suggest to be sure to be sure to not emit more than 25Hz and see if it is enough for scratch using only user message.
It does not seem to be interference from the motors. Running the spirograph example with fake motors (so that odometery works) still leads to many dropped messages.
On the other hand, blindly resending Q_* messages 200 ms later leads to only 1 dropped message out of 120. Resending is all right because thymio_motion.aesl ignores duplicate tasks (davidjsherman/inirobot-scratch-thymioII@a6724c4a960ecc9b0cbed0f519b939e677240d91). So adding a simple handshaking protocol seems like the surest bet.
I have implemented in 81675caf020f0301314fb825532a09498c290d68 an ad hoc handshaking scheme for the motion events in asebascratch, that resends messages until they are acknowledged or time out. It seems to work. If this seems reliable, perhaps it can be adapted into a more general mechanism, at least for a subset of messages that are deemed important. (Which ones?)
Hi, this is a very good news. I do not think flow of sensor have to be handshaked. Only message that are mandatory; in the thymio-motion I think only qmotion have to. Otherwise make something more general is not necessary before we handle more deeply the wireless module. We know the limitation of it, we have a working solution for asebascratch on wireless and we have now to make it tested by users. I will make some test next week. Thank you for your hard work!
I just see the package Scratch to Thymio, this help a lot for testing. Thanks
I can test too. Where is the package ? Thanks, Didier
I have put up pre-release 0.7.2-alpha with the very latest corrections. In addition to the Scratch2-ThymioII folders I have added unofficial packages of the complete Aseba system including Scratch. If you want the nightly builds I can give you access to our build platform at http://ci.inria.fr.
Ok. Can you give me access to ci.inria.fr ? Thank you !
Asebascratch works reliably when connected to the robot over USB, even when daisy-chained behind switches (asebaswitch, asebahttp, or asebaswitch). However, when connected to a robot using the wireless RF connection, its behavior is unreliable.
To separate this problem from job control (issue #22), commits 5914ea26a3cfe4e003c291e397e3b3299f8c4302 and 3fce45484cf4e300f06e5e88cdb07df8d1d5312e introduce message generators that consistently reproduce this behavior.
The attached file traces.tar.gz shows two runs of a newly powered-on Thymio RF robot reacting to the generated messages, either connected using USB (
*.wired.*.txt
) or connected using the RF dongle (*.rf.*.txt
).