Linaro / lite-lava-docker-compose

LITE Team LAVA docker dispatcher
MIT License
5 stars 10 forks source link

LAVA tests may get output from the previous test binary #143

Open pfalcon opened 3 years ago

pfalcon commented 3 years ago

I've seen this before, but didn't record cases, and now I do. https://staging.validation.linaro.org/scheduler/job/282676 shows that a test may get output from the previous binary flashed on the board (instead of the binary which was flashed by the current test job running). Output parsing then may start based on this old output. And we even may imagine that it may stop based on this old output. Then we may get false positive (or negative, too) test result, where old testsuite passed and that's what we parsed, but later real testsuite actually failed. I never saw such a case. But the matter, we aren't protect from such cases.

For a long time, my "usual suspect" of where the problem lies was ser2net, hence https://github.com/Linaro/lite-lava-docker-compose/issues/129. But more or less detailed investigation I undertook in that ticket showed that ser2net itself doesn't do any particular magic, beyond sending to board UART BREAK condition when a network client connects to it. I didn't see any particular "caching" to be documented (or visible) on its side.

So, the "caching" problem would be somewhere else. And we should consider any part of the connection pipeline: LAVA Lab additional hardware (hubs/multiplexors) -> Linux kernel -> ser2net -> expect. Of particular interest is that so far I saw such cases only with the LAVA Lab jobs, but it can be easily explained by the fact that when running locally, I usually flash the same test binary over and over again, while in the Lab, there's true variety.

Or it can be not the case of "caching" after all. Maybe we truly connect before the new test binary is fully flashed, and the old one still runs. But that's even more strange explanation, with the job linked above - there, we make sure that we follow the proper CMSIS DAP flashing procedure, and wait for CMSIS DAP to report "flashing complete" event. Still, maybe it would make sense to try ol' good delay between flashing and actual serial connection.

pfalcon commented 3 years ago

And we should consider any part of the connection pipeline: LAVA Lab additional hardware (hubs/multiplexors) -> Linux kernel -> ser2net -> expect.

I reproduce this in local setup with RasPi. So, we can strike out "LAVA Lab additional hardware" from the list.

Trying 127.0.0.1...
Connected to ser2net.
Escape character is '^]'.
*** Booting Zephyr OS build zephyr-v2.4.0-2668-g201a8d29cdb2  ***
Running test suite poll_api
ok: start string found, lava test monitoring started
test monitoring timeout: 60 seconds
=========================================================�*** Booting Zephyr OS build zephyr-v2.4.0-2668-g201a8d29cdb2  ***
Running test suite test_log_list
===================================================================
START - test_log_list
 PASS - test_log_list
pfalcon commented 3 years ago

pyocd boot method is in particular known to run target binary immediately after flashing (unlike cmsis-dap). To avoid that, need to pass -O resume_on_disconnect=false to pyocd.