Release 2019.01 - RC2 - Githubissues

aabadie commented 5 years ago

This issue lists the status of all tests for the Release Candidate 2 of the 2019.01 release.

Specs tested:

[ ] 01-ci
- [ ] Task #01 - Compile test
- [x] Task #02 - Unittests on native @aabadie
- [x] Task #03 - Unittests on native separated @aabadie
- [x] Task #04 - Unittests on iotlab-m3 @aabadie
[x] 02-tests @aabadie
[ ] 03-single-hop-ipv6-icmp
- [x] Task #01 - ICMPv6 multicast echo on native @aabadie
- [x] Task #02 - ICMPv6 link-local echo on native @aabadie
- [x] Task #03 - ICMPv6 link-local echo on native (1 hour)
- [ ] Task #04 - ICMPv6 stress test on native (1 hour)
[ ] 04-single-hop-6lowpan-icmp @aabadie
[x] 05-single-hop-route @aabadie
[x] 06-single-hop-udp
[ ] 07-multi-hop
[ ] 08-interop
[x] 09-coap @kb2ma
- [x] Task #01 - CORD Endpoint
- [x] Task #02 - Confirmable retries
- [x] Task #03 - Block1
- [x] Task #04 - Block2
- [x] Task #05 - Observe registration and notification
[x] 10-icmpv6-error @miri64
[ ] 99-compile-and-test-one-board
- [ ] Task #01 - Run tests on different hardwares
- [x] iotlab-m3 with BUILD_IN_DOCKER=1 same as normal + failed compilation for tests/riotboot
- [ ] arduino-mega2560 I will list issues and details below Full errors list: https://github.com/RIOT-OS/Release-Specs/issues/98#issuecomment-461403025
  - Firmware failing
    - tests/bitarithm_timings/ : firmware does not go past the bitarithm_msb: 102096 iterations per second
    - tests/periph_eeprom: EEPROM_CLEAR_BYTE api not handled in the test https://github.com/RIOT-OS/RIOT/pull/11005
    - tests/periph_gpio: both firmware and test script I think. Some values look like to overflow 4294930872us and a timeout in the test for getting one line 02-bench.py:22 for bench 0 4 it works with a 30 seconds timeout for expect at least.
    - tests/pipe: the output is completely wrong and mixed
    - tests/pkg_jsmn: output is wrong no value looks like to be read
    - tests/pkg_libb2: blake2_tests.test_blake2s (/data/riotbuild/riotproject/tests/pkg_libb2/main.c 56) memcmp(b2s, hash, sizeof hash) == 0
    - tests/pkg_lora-serialization:
```
2019-02-12 12:33:16,709 - INFO # Test 2
2019-02-12 12:33:16,734 - INFO # Coordinates and unix time
2019-02-12 12:33:16,771 - INFO # ---------------------------------
2019-02-12 12:33:16,816 - INFO # - Writing coordinates: -33.905052, 151.26641
2019-02-12 12:33:16,849 - INFO # - Writing unix time: 1467632413
2019-02-12 12:33:16,902 - INFO # - Encoded:  64 a6 ff ff 60 24 00 00 1d 4b 00 00
2019-02-12 12:33:16,951 - INFO # - Expected: 64 a6 fa fd 6a 24 04 09 1d 4b 7a 57
2019-02-12 12:33:16,984 - INFO # ---------------------------------
2019-02-12 12:33:16,992 - INFO # FAILED
```
    - tests/posix_semaphore:
```
2019-02-12 12:35:09,967 - INFO # ######################### TEST4:
2019-02-12 12:35:09,983 - INFO # first: sem_init s1
2019-02-12 12:35:10,012 - INFO # first: wait 1 sec for s1
2019-02-12 12:35:14,026 - INFO # first: timed out
2019-02-12 12:35:14,075 - INFO # first: waited too long 3999748 usec => FAILED
2019-02-12 12:35:14,108 - INFO # ######################### DONE
```
    - tests/pkg_micro-ecc: fails in the middle but micro-ecc does not have 8bit/16bit support.
    - tests/pkg_tiny-asn1: ERROR: Could not allocate the memory for the ASN.1 objects.
    - tests/ps_schedstatistics: the main thread cannot read stdin changing threads priority to a lower one (THREAD_PRIORITY_MAIN + 1) makes the test pass. (BTW test is wrongly written as | has a meaning for pexpect).
    - tests/rng: float printing issue: https://github.com/RIOT-OS/RIOT/pull/10999
    - tests/trickle: it stops after [TRICKLE RESET] (known issue I think)
  - Test script failing
    - tests/evtimer_msg/: margin is not big enough for the platform At 770 ms received msg 0: "supposed to be 659"
  - Non reproducible error:
    - tests/isr_yield_higher: firmware was printing TEST FAILED but I cannot reproduce anymore
  - Non checked issue
    - tests/libfixmath: output is flowing correctly. I see that there is output with letters 2.H532 but not sure if it is the issue. Not checked yet.

miri64 commented 5 years ago

As a bonus I rebased https://github.com/RIOT-OS/RIOT/pull/10908 to 2019.01-RC2 and ran the test. It passed ;-).

miri64 commented 5 years ago

8.3 still works on 2019.01-RC2 as well.

miri64 commented 5 years ago

8.4 still works. BTW also without the compile flag (which I accidentally forgot to set), since the Raspberry Pi I'm using has the ABRO configured in its radvd.conf ;-).

miri64 commented 5 years ago

10 still works, however still found 2 bugs in the testing procedures (see #100 and #101)

cladmi commented 5 years ago

I will run the automated tests on boards, at least iotlab-m3, samr21-xpro and see what other boards I can run on that I have here for the next step.

cladmi commented 5 years ago

For both iotlab-m3 and samr21-xpro all automated tests ran with these failed tests:

Failures during test:
- tests/gnrc_ipv6_ext
- tests/gnrc_rpl_srh
- tests/pkg_fatfs_vfs  This one need an sd_card so cannot really run

The gnrc_ipv6_ext and gnrc_rpl_srh requires root when running the test, I could run them manually: Note, arm-gcc is not in my normal path so not seen when run with sudo, so we currently have printed errors. However, it is using my locally installed python packages so no need for a special setup there.

BOARD=samr21-xpro make -C tests/gnrc_ipv6_ext flash

sudo BOARD=samr21-xpro make --no-print-directory -C tests/gnrc_ipv6_ext/ test
/bin/sh: 1: arm-none-eabi-gcc: not found
/home/harter/work/git/worktree/riot_release/makefiles/toolchain/gnu.inc.mk:18: objcopy not found. Hex file will not be created.
...................SUCCESS

And for gnrc_rpl_srh

BOARD=samr21-xpro make -C tests/gnrc_rpl_srh flash

sudo BOARD=samr21-xpro make --no-print-directory -C tests/gnrc_rpl_srh test
/bin/sh: 1: arm-none-eabi-gcc: not found
/home/harter/work/git/worktree/riot_release/makefiles/toolchain/gnu.inc.mk:18: objcopy not found. Hex file will not be created.
..............SUCCESS

I think these tests should be somehow defined as ADMIN_TESTS or something in RIOT. This would allow special handling for these ones like having an admin-test target or something.

miri64 commented 5 years ago

Note, arm-gcc is not in my normal path so not seen when run with sudo, so we currently have printed errors.

You don't need to build and you don't need to flash with sudo (unless you did not configure your udev rules of course). Just the execution of the test script requires root.

miri64 commented 5 years ago

I think these tests should be somehow defined as ADMIN_TESTS or something in RIOT. This would allow special handling for these ones like having an admin-test target or something.

I think the name admin-test might be misleading. There is a difference between the root user and an admin user (though they might be the same person in some cases) ;-)

cladmi commented 5 years ago

Note, arm-gcc is not in my normal path so not seen when run with sudo, so we currently have printed errors.

You don't need to build and you don't need to flash with sudo (unless you did not configure your udev rules of course). Just the execution of the test script requires root.

Yes, but it still tries to evaluate arm-none-eabi-gcc even when running tests, it is an unrelated issue, just noted the error message.

I think these tests should be somehow defined as ADMIN_TESTS or something in RIOT. This would allow special handling for these ones like having an admin-test target or something.

I think the name admin-test might be misleading. There is a difference between the root user and an admin user (though they might be the same person in some cases) ;-)

It was more on the concept than the name, I was not confident with root-test either, maybe a privileged-tests or something. But it would be a dedicated discussion in an issue/PR.

cladmi commented 5 years ago

Compiling and running tests for iotlab-m3 on a machine with no toolchain using docker, also has tests/riotboot failing to compile:

make RIOT_CI_BUILD=1 CC_NOCOLOR=1 --no-print-directory -C ./tests/riotboot clean all --jobs
make: *** No rule to make target '/srv/ilab-builds/workspace/git/riot_release/tests/riotboot/bin/iotlab-m3/tests_riotboot-slot0.bin', needed by '/srv/ilab-builds/workspace/git/riot_release/tests/riotboot/bin/iotlab-m3/tests_riotboot-slot0.hdr'.  Stop.
make: *** Waiting for unfinished jobs....
compiling /srv/ilab-builds/workspace/git/riot_release/dist/tools/riotboot_gen_hdr/bin/genhdr...

Return value: 2

aabadie commented 5 years ago

@cgundogan, @jia200x, @leandrolanzieri, you checked some items for the RC1. Can you try again on RC2? That would help a lot, thanks!

cladmi commented 5 years ago

I just re-run the grnc_ipv6_ext example with samr21-xpro after rebooting and the test fails:

BOARD=samr21-xpro  make -C tests/gnrc_ipv6_ext/ flash
...
sudo BOARD=samr21-xpro PATH=${PATH} make -C tests/gnrc_ipv6_ext/ test
make: Entering directory '/home/harter/work/git/worktree/riot_release/tests/gnrc_ipv6_ext'

Traceback (most recent call last):
  File "/home/harter/work/git/worktree/riot_release/tests/gnrc_ipv6_ext/tests/01-run.py", line 646, in <module>
    sys.exit(run(testfunc, timeout=1, echo=False))
  File "/home/harter/work/git/worktree/riot_release/dist/pythonlibs/testrunner/__init__.py", line 56, in run
    testfunc(child)
  File "/home/harter/work/git/worktree/riot_release/tests/gnrc_ipv6_ext/tests/01-run.py", line 596, in testfunc
    lladdr_src = get_host_lladdr(tap)
  File "/home/harter/work/git/worktree/riot_release/tests/gnrc_ipv6_ext/tests/01-run.py", line 587, in get_host_lladdr
    "Can't find host link-local address on interface {}".format(tap)
AssertionError: Can't find host link-local address on interface tap0
/home/harter/work/git/worktree/riot_release/tests/gnrc_ipv6_ext/../../Makefile.include:568: recipe for target 'test' failed
make: *** [test] Error 1
make: Leaving directory '/home/harter/work/git/worktree/riot_release/tests/gnrc_ipv6_ext'

cladmi commented 5 years ago

As the test is running ethos alone, without pre-creating the interface, it stays in the down state.

ip link show dev tap0
27: tap0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 36:21:8b:bf:9a:1f brd ff:ff:ff:ff:ff:ff

Doing as in start_network.sh, with creating the interface and putting it up before the test it works.

When dist/tools/tapsetup/tapsetup has been run before it also works, this should have been my state on the previous run.

cladmi commented 5 years ago

Note on tests with arduino-mega2560, there are several tests failure, even some that look important:

Failures during test:
- [tests/bitarithm_timings](tests/bitarithm_timings/test.failed)
- [tests/evtimer_msg](tests/evtimer_msg/test.failed)
- [tests/isr_yield_higher](tests/isr_yield_higher/test.failed)
- [tests/libfixmath](tests/libfixmath/test.failed)
- [tests/periph_eeprom](tests/periph_eeprom/test.failed)
- [tests/periph_gpio](tests/periph_gpio/test.failed)
- [tests/pipe](tests/pipe/test.failed)
- [tests/pkg_fatfs_vfs](tests/pkg_fatfs_vfs/test.failed)
- [tests/pkg_jsmn](tests/pkg_jsmn/test.failed)
- [tests/pkg_libb2](tests/pkg_libb2/test.failed)
- [tests/pkg_lora-serialization](tests/pkg_lora-serialization/test.failed)
- [tests/pkg_micro-ecc](tests/pkg_micro-ecc/test.failed)
- [tests/pkg_tiny-asn1](tests/pkg_tiny-asn1/test.failed)
- [tests/posix_semaphore](tests/posix_semaphore/test.failed)
- [tests/ps_schedstatistics](tests/ps_schedstatistics/test.failed)
- [tests/rng](tests/rng/test.failed)
- [tests/trickle](tests/trickle/test.failed)

I will look into them and give details.

aabadie commented 5 years ago

@miri64, I manually ran the 3.4 task and the packet buffer of the receiving node is not empty, even several seconds after the end of the test. I put the content in this gist I forgot to check the pktbuf of the sending nodes but they were still reachable.

For my setup, I created 11 tap interfaces using tapsetup and attached 11 native instances to each tap. Then 10 of the nodes started to send pings (I set count to 1000) to a single one (on tap0).

miri64 commented 5 years ago

For my setup, I created 11 tap interfaces using tapsetup and attached 11 native instances to each tap. Then 10 of the nodes started to send pings (I set count to 1000) to a single one (on tap0).

Why not use the ping6 command of your host machine?

miri64 commented 5 years ago

I can't really reproduce your results :-/. I tried

sudo true; for _ in $(seq 10); do sudo ping -c 1000 -s 1452 -f fe80::9494:71ff:fe6c:d5a%tapbr0 & done
sudo true; for _ in $(seq 10); do sudo ping -c 1000 -s 1452 -i .01 fe80::9494:71ff:fe6c:d5a%tapbr0 & done
sudo true; for _ in $(seq 10); do sudo ping -c 1000 -s 1452 -i .001 fe80::9494:71ff:fe6c:d5a%tapbr0 & done

None caused any leaks.

miri64 commented 5 years ago

I'll see what happens, when I run a long-term experiment tonight.

aabadie commented 5 years ago

A few testing:

I ran your commands (only with -i 0) from my host and the pktbuf on the native node is never filled even during the flood. The RIOT node is always active, no error message is displayed
Then I ping the RIOT node from different shells and got a lot of gnrc_netif: possibly lost interrupt. on the native node. It remains active, even during the flood, the pktbuf is always empty

My guess is that in these 2 cases, the src address of the ping is always the same (the one of Linux host), so maybe a lot of them could be dropped during the flood ? When I tried from 10 RIOT native nodes, the src addresses were all different because attached to a different interface.

miri64 commented 5 years ago

I ran your commands (only with -i 0) from my host and the pktbuf on the native node is never filled even during the flood. The RIOT node is always active, no error message is displayed

-f implies -i 0 ;-)

miri64 commented 5 years ago

I ran it during the night. I also had possibly lost interrupts (which just means that the gnrc_netif message queue was full), but the packet buffer is empty.

miri64 commented 5 years ago

I will try again from different addresses though.

miri64 commented 5 years ago

@aabadie can you share your test script? I was also not able to reproduce the issue you described using https://gist.github.com/miri64/fac4df86be36f0a65d9bdb4d2f09d5c7 (the check of the packet buffer fails to match for some reason, but it is empty).

aabadie commented 5 years ago

can you share your test script?

not possible, I did the setup (tapsetup -c 11), started the native instances and launched the pings manually on different terminals. I can try your script.

miri64 commented 5 years ago

But how is this a Stresstest? With count 1000 the ping is faster done than you can copy the ping command to all terminals.

aabadie commented 5 years ago

With count 1000 the ping is faster done than you can copy the ping command to all terminals.

Sure, but prepare each terminal with the ping commands and then switch between them and launch them. Using keyboard shortcuts, this can be done faster enough to trigger a lot of ping timeout, slowing down everything. I have no idea what is going on during this test and how it is supposed to behave. Maybe the python script is introducing side effects because of the GIL ? Are we sure the pings are performed in parallel ?

miri64 commented 5 years ago

With count 1000 the ping is faster done than you can copy the ping command to all terminals.

Sure, but prepare each terminal with the ping commands and then switch between them and launch them. Using keyboard shortcuts, this can be done faster enough to trigger a lot of ping timeout, slowing down everything. I have no idea what is going on during this test and how it is supposed to behave. Maybe the python script is introducing side effects because of the GIL ? Are we sure the pings are performed in parallel ?

No. But since I was already quite annoyed by doing this with two terminals when I analyzed https://github.com/RIOT-OS/RIOT/issues/10672, I'm going to write some script that does the same thing in bash and come back to you.

miri64 commented 5 years ago

Ok, I was able to reproduce with this script https://gist.github.com/miri64/fac4df86be36f0a65d9bdb4d2f09d5c7#file-03-4-test-sh. Since it is working for one neighbor but having problems with 10, I suspect something to go wrong in the neighbor discovery.

miri64 commented 5 years ago

I even was able to produce a segmentation fault now :o

miri64 commented 5 years ago

Though I'd still like to find out, how exactly it happens I opened https://github.com/RIOT-OS/RIOT/pull/10975 to fix the segfault for now. I wasn't able to reproduce the leak with that fix as well with count 1000 and 10000 though I don't understand why either. I investigate further.

miri64 commented 5 years ago

https://github.com/RIOT-OS/RIOT/pull/10975 makes the occurrence of the leak harder to reproduce, but I already saw one again. My suspicion is, that (from https://github.com/RIOT-OS/RIOT/pull/10975#issuecomment-461917017)

When just the last element is removed the situation "fixes" itself, since the entry is still referred to by the first position, so re-adding it just leads to a loop of one (breaking the list, but not the system ;-)).

Lead to a number of leaks, that now don't occur anymore.

miri64 commented 5 years ago

In your gist, I'm still very confused about the start

00000000  02  01  42  B9  98  78  C8  00  02  01  42  B9  98  78  C8  00
00000010  02  01  42  B9  98  78  C8  00  02  01  42  B9  98  78  C8  00
00000020  02  01  42  B9  98  78  C8  00  00  00  00  00  50  AD  64  56

apart from the last 8 byte (a start of a packet snip) I'm not really sure what the repeating sequence is... :-/

miri64 commented 5 years ago

@aabadie With https://github.com/RIOT-OS/RIOT/pull/10978 I can't produce any leaks at the moment.

I had one occurrence while debugging this where I had a gnrc_netif_hdr stuck in the packet buffer (all packets that are released with the fix in https://github.com/RIOT-OS/RIOT/pull/10978 should not have a netif header, since it is removed here), so I'm not 100% confident if it removes all leaks for case you described.

cgundogan commented 5 years ago

I will re-do the multihop tests once I have a stable Internet connectivity. Probably around noon.

miri64 commented 5 years ago

In your gist, I'm still very confused about the start
00000000  02  01  42  B9  98  78  C8  00  02  01  42  B9  98  78  C8  00
00000010  02  01  42  B9  98  78  C8  00  02  01  42  B9  98  78  C8  00
00000020  02  01  42  B9  98  78  C8  00  00  00  00  00  50  AD  64  56
apart from the last 8 byte (a start of a packet snip) I'm not really sure what the repeating sequence is... :-/

Ah, those are target link-layer address options for the address 42:b9:98:78:c8:00.

cladmi commented 5 years ago

I added the details of arduino-mega2560 failures in the main post. I will see what I can fix.

cladmi commented 5 years ago

It looks like the issue for overflow in tests/periph_gpio comes from the fact that benchmark takes timing from within masked interrupts and xtimer_now_usec does not look like implemented to work from within masked interrupts for arduino-mega2560.

aabadie commented 5 years ago

Closing in favor of #105

aabadie commented 5 years ago

Closing now that there's #105

RIOT-OS / Release-Specs

Release 2019.01 - RC2 #98