Print only first node error and report more metadata in dataflow results

phil-opp commented 3 weeks ago

Allows us to mark certain node errors as cascading, which will deprioritize them when printed.

Also updates the error printing code to only print the error that happened first.

Example Output:

Node rust-sink exits with a success exit status before initializing dora connection:

❯ dora start dataflow.yml --attach
01901beb-7fd7-7183-8584-8e6284c84031
dataflow failed: Node `dora-record` failed: exited with code: 1

This error occurred because node `rust-sink` exited before connecting to dora.

There are 2 more errors. Check the `out/01901beb-7fd7-7183-8584-8e6284c84031` folder for full details.

Node rust-sink errored before initialization:

❯ dora start dataflow.yml --attach
01902594-d2e4-7b4f-a2d5-0d3e059a52fa
dataflow failed: Node `rust-sink` failed: exited with code: 1

Stderr output:
---------------------------------------------------------------------------------
Error: foo

Location:
    examples/rust-dataflow/sink/src/main.rs:5:5

[...]
---------------------------------------------------------------------------------

There are 3 more errors. Check the `out/01902594-d2e4-7b4f-a2d5-0d3e059a52fa` folder for full details.

TODO:

[x] mark node failures as cascading when initialization fails because other node exited too early
[x] testing
[x] load stderr
[x] Only skip cascading errors
[x] Add error cause for grace period kills
[x] Separate PR: Add --detach arg to dora start and make --attach the default
- Opened separate PR in https://github.com/dora-rs/dora/pull/561

haixuanTao commented 2 weeks ago

How is this related to #542 ?

Should we merge 542 first?

And if so should i make change within 542?

phil-opp commented 2 weeks ago

This PR isn't finished yet, but it uses a different approach than #542. The idea is that we report all node errors to the CLI in their original form and include some additional metadata (e.g. timestamp). This way, we can do the printing in the CLI based on all the available information. So the CLI could print a short summary by default and provide options to print more details. Also, we can then format the errors in a nicer way because the CLI is aware of the terminal size and can use colors and text decorations (e.g. bold text).

phil-opp commented 2 weeks ago

@haixuanTao I updated the PR description with some example output. Please let me know what you think!

As follow-up changes we can implement your suggestion from https://github.com/dora-rs/dora/pull/548#issuecomment-2167964016 and maybe some nicer terminal formatting (colors, text decoration, proper lines, etc).

haixuanTao commented 2 weeks ago

It also seems to be no way to see other errors reason whether they are similar to the first one or not related at all.

019026ac-8bd8-7b67-bec4-62b0a9f3fca1
^Cdataflow failed: Node `reachy-vision-client` failed: exited because of signal SIGKILL

There are 3 more errors. Check the `out/019026ac-8bd8-7b67-bec4-62b0a9f3fca1` folder for full details.

For example in this example, I have probably an error about the grace duration on the first node, but there is no way for me to know what was the others errors.

In the past we used to have a grace duration information, and now we onty have SIGKILL.

phil-opp commented 2 weeks ago

I'm slightly confused as this only print out one error, I would expect to print all of them if they happen at the same time.

Sure, we can print them all. The error filtering and formatting now happens on the CLI side, so we can be as verbose as we like. We probably still want to skip cascading errors (i.e. "node failed because another node exited before init"), right?

Also, I don't like that the default dora start does not return anything in case of error. Can we make the attach the default behaviour because otherwise people starting with dora are going to be confused.

Fine with me! We could add a --detach argument to run it in the background and make dora start equivalent to dora start --attach. I'll prepare a separate PR for that.

In the past we used to have a grace duration information, and now we onty have SIGKILL.

The warning is still printed by the daemon, it's just not visible if the daemon is spawned in background (e.g. using dora up) or on a remote machine. The proper solution for this would be to add a new exit cause for this and communicate it back to the CLI. I'll try to implement something like this tomorrow.

phil-opp commented 2 weeks ago

I'm slightly confused as this only print out one error, I would expect to print all of them if they happen at the same time.

I pushed 94fecd6 to print all node errors that are not consequential errors (e.g. "node X exited too early").

haixuanTao commented 2 weeks ago

I think it is important to have errors when they happen in this PR as currently if a node fails but the others keep on running. The dataflow can run forever before an error message is emitted.

The only way to know there is an error is to use ctrl+c

haixuanTao commented 2 weeks ago

Could we take the times to write some test of the error message?

phil-opp commented 1 week ago

Could we take the times to write some test of the error message?

Which part do you want to test?

phil-opp commented 1 week ago

I think it is important to have errors when they happen in this PR as currently if a node fails but the others keep on running. The dataflow can run forever before an error message is emitted.

I prepared a PR with a basic log framework in https://github.com/dora-rs/dora/pull/559. This PR enables daemons to send log messages to the CLI (via the coordinator). The CLI can then print daemon warnings/error as they happen.

phil-opp commented 1 week ago

In the past we used to have a grace duration information, and now we onty have SIGKILL.

The warning is still printed by the daemon, it's just not visible if the daemon is spawned in background (e.g. using dora up) or on a remote machine. The proper solution for this would be to add a new exit cause for this and communicate it back to the CLI. I'll try to implement something like this tomorrow.

I pushed ee394361 to handle grace duration kills properly. edit: I shortened the error message in ece3f72a.

The output now is (the ^C indicates a ctrl-c event):

❯ dora start dataflow.yml --attach
01904986-3d33-702e-9ee5-bb54a3a5c3be
^CDataflow 01904986-3d33-702e-9ee5-bb54a3a5c3be failed:

Node `rust-node` failed: node was killed by dora because it didn't react to a stop message in time (SIGKILL)

dora-rs / dora

Print only first node error and report more metadata in dataflow results #552

Example Output:

TODO: