Closed phil-opp closed 1 week ago
How is this related to #542 ?
Should we merge 542 first?
And if so should i make change within 542?
This PR isn't finished yet, but it uses a different approach than #542. The idea is that we report all node errors to the CLI in their original form and include some additional metadata (e.g. timestamp). This way, we can do the printing in the CLI based on all the available information. So the CLI could print a short summary by default and provide options to print more details. Also, we can then format the errors in a nicer way because the CLI is aware of the terminal size and can use colors and text decorations (e.g. bold text).
@haixuanTao I updated the PR description with some example output. Please let me know what you think!
As follow-up changes we can implement your suggestion from https://github.com/dora-rs/dora/pull/548#issuecomment-2167964016 and maybe some nicer terminal formatting (colors, text decoration, proper lines, etc).
It also seems to be no way to see other errors reason whether they are similar to the first one or not related at all.
019026ac-8bd8-7b67-bec4-62b0a9f3fca1
^Cdataflow failed: Node `reachy-vision-client` failed: exited because of signal SIGKILL
There are 3 more errors. Check the `out/019026ac-8bd8-7b67-bec4-62b0a9f3fca1` folder for full details.
For example in this example, I have probably an error about the grace duration on the first node, but there is no way for me to know what was the others errors.
In the past we used to have a grace duration information, and now we onty have SIGKILL
.
I'm slightly confused as this only print out one error, I would expect to print all of them if they happen at the same time.
Sure, we can print them all. The error filtering and formatting now happens on the CLI side, so we can be as verbose as we like. We probably still want to skip cascading errors (i.e. "node failed because another node exited before init"), right?
Also, I don't like that the default
dora start
does not return anything in case of error. Can we make the attach the default behaviour because otherwise people starting with dora are going to be confused.
Fine with me! We could add a --detach
argument to run it in the background and make dora start
equivalent to dora start --attach
. I'll prepare a separate PR for that.
In the past we used to have a grace duration information, and now we onty have
SIGKILL
.
The warning is still printed by the daemon, it's just not visible if the daemon is spawned in background (e.g. using dora up
) or on a remote machine. The proper solution for this would be to add a new exit cause for this and communicate it back to the CLI. I'll try to implement something like this tomorrow.
I'm slightly confused as this only print out one error, I would expect to print all of them if they happen at the same time.
I pushed 94fecd6 to print all node errors that are not consequential errors (e.g. "node X exited too early").
I think it is important to have errors when they happen in this PR as currently if a node fails but the others keep on running. The dataflow can run forever before an error message is emitted.
The only way to know there is an error is to use ctrl+c
Could we take the times to write some test of the error message?
Could we take the times to write some test of the error message?
Which part do you want to test?
I think it is important to have errors when they happen in this PR as currently if a node fails but the others keep on running. The dataflow can run forever before an error message is emitted.
I prepared a PR with a basic log framework in https://github.com/dora-rs/dora/pull/559. This PR enables daemons to send log messages to the CLI (via the coordinator). The CLI can then print daemon warnings/error as they happen.
In the past we used to have a grace duration information, and now we onty have
SIGKILL
.The warning is still printed by the daemon, it's just not visible if the daemon is spawned in background (e.g. using
dora up
) or on a remote machine. The proper solution for this would be to add a new exit cause for this and communicate it back to the CLI. I'll try to implement something like this tomorrow.
I pushed ee394361 to handle grace duration kills properly. edit: I shortened the error message in ece3f72a.
The output now is (the ^C
indicates a ctrl-c
event):
❯ dora start dataflow.yml --attach
01904986-3d33-702e-9ee5-bb54a3a5c3be
^CDataflow 01904986-3d33-702e-9ee5-bb54a3a5c3be failed:
Node `rust-node` failed: node was killed by dora because it didn't react to a stop message in time (SIGKILL)
Allows us to mark certain node errors as cascading, which will deprioritize them when printed.
Also updates the error printing code to only print the error that happened first.
Example Output:
Node
rust-sink
exits with a success exit status before initializing dora connection:Node
rust-sink
errored before initialization:TODO:
--detach
arg todora start
and make--attach
the default