Linaro / lite-lava-docker-compose

LITE Team LAVA docker dispatcher
MIT License
5 stars 10 forks source link

Known ser2net issues related to the integrity of the received data #129

Open pfalcon opened 3 years ago

pfalcon commented 3 years ago

These issues are attributed to ser2net, but most likely are rooted in the underlying UART connection (which involves particular USB2UART adapter, its driver, etc.). However, some of the issue could be amplified by ser2net, and potentially could be alleviated/worked around on ser2net side (with suitable configuration and/or patching).

I've see 2 issues:

  1. Some initial characters received after connection starts in LAVA (supposedly, after board is reset/powered-on), may be cut off or garbled. Examples:
Trying 172.21.0.9...
Connected to ser2net.
Escape character is '^]'.
���@ooting Zephyr OS build zephyr-v1.14.0-5799-g42c5b0a7fafa  ***
[00:00:00.005,000] [0m<inf> net_config: Initializing network[0m
Trying 172.21.0.9...
Connected to ser2net.
Escape character is '^]'.
������ting Zephyr OS build zephyr-v2.2.0-1442-g4215968b7466  ***
[00:00:00.006,000] [0m<inf> net_config: Initializing network[0m
Trying 172.21.0.9...
Connected to ser2net.
Escape character is '^]'.
�** Booting Zephyr OS build zephyr-v2.2.0-1442-g4215968b7466  ***

A (not so) obvious workaround here is to not try to match from the very beginning of the "boot" message, e.g. above "ing Zephyr OS build" would work. But it's unclear how much number of garbled/cut bytes depends on particular hardware and other variables (like system load). It's also "not so obvious" because it doesn't happen all the time (on all hardware?), so it's easy to forget about this trick.

  1. The second issue is the opposite - I don't have a handy evidence, but for some time I'm growing suspicion that on some setups/under some circumstance ser2net vice-versa, caches old UART content, from the previous firmware run. I.e. some output gets cached, then new firmware programmed, then board resets, but the test gets the stale cached content first. It can be matched by a LAVA tests, forming either the false positive (it's old test succeeded, current may very well fail), or confuse the flow of the test, so the result is not reliable.

This issue needs further investigation. An "obvious" workaround would be to require each test/sample to print its (unique) name, and match on that, but is that enforceable e.g. in Zephyr? And it definitely doesn't work with an arbitrary software out their (we can't patch everything for the deficiencies of our test infra).

pfalcon commented 3 years ago

@galak: I wonder, if you saw issues like that and can add/comment anything?

In all fairness, these issues are kinda baseline for testing. You first pose them, then provide good, reliable, robust answers, then there can be similarly robust automated testing. And I don't remember somebody talking and writing about it. Maybe it's just me, but then well, we have a chance to talk about these issues and intended solutions to them, make sure they're recorded properly (e.g. in wiki), and applied consistently and properly (and overall under scrutiny if something goes wrong).

pfalcon commented 3 years ago

The second issue is the opposite - I don't have a handy evidence, but for some time I'm growing suspicion that on some setups/under some circumstance ser2net vice-versa, caches old UART content

Ok, the foundation for this suspicion (besides belief that I cursorily saw false positives which can be attributed to it) is easy to reproduce: run telnet localhost 5001. You'll immediately get some output:

Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
*** Booting Zephyr OS build v2.4.0-rc1-55-g6fe1fde9dcb1  ***
Running test suite test_log_list
===================================================================
START - test_log_list
 PASS - test_log_list
===================================================================
START - test_log_list_multiple_items
 PASS - test_log_list_multiple_items
===================================================================
Test suite test_log_list succeeded
===================================================================
PROJECT EXECUTION SUCCESSFUL

I.e. ser2net caches some output, and then serves it at once when you connect.

But not only that, trying to connect again gets the same cached output to be served again, and again! This is completely contra-logical to how a UART port works. With it, you connect and start to received only new data which appears after you connected. One would expect the same from any faithful "UART proxy", but nope, ser2net has got it different.

pfalcon commented 3 years ago

Ok, if nothing helps, try to read man: https://linux.die.net/man/8/ser2net :

The program comes up normally as a daemon, opens the TCP ports specified in the configuration file, and waits for connections. Once a connection occurs, the program attempts to set up the connection and open the serial port.

Aha, so it works in yet another way different from intuition and/or suspicions. And trying to connect to picocom before running telnet in another windows shows that picocom also received some input (characters are stolen between 2 sinks of picocom vs ser2net). So, that output is not cached, it's actuall regenerated, ser2net somehow "resets" the board on connection, in a way that picocom doesn't.

Only thing left is to figure out how it does it. Playing in picocom with DTR, etc. solves the mystery:

C-\ Generate a break sequence on the serial line. A break sequence is usually generated by marking (driving to logical one) the serial Tx line for an amount of time coresponding to several character durations.

Ok, so there's logic in it all, it's just not everyone knows it ;-).

pfalcon commented 3 years ago

Ok, confirmed that adding NOBREAK to ser2net.conf disables this behavior. Described in man as:

NOBREAK Disables automatic clearing of the break setting of the port.

Not exactly clearest of the descriptions.

With this setting, LAVA doesn't reset the board itself, so tests fail.

In other words, in our local setup, we rely on ser2net to boot the board for us, not doing that explicitly in LAVA.