Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
327 stars 206 forks source link

test flake: sharing-service #8122

Closed turadg closed 1 year ago

turadg commented 1 year ago

Describe the bug

the sharing-service test sometimes fails spuriously.

To Reproduce

It's intermittent but here's a spuriously failing job: https://github.com/Agoric/agoric-sdk/actions/runs/5718923826/job/15495865362?pr=8102

and its re-run that passed: https://github.com/Agoric/agoric-sdk/actions/runs/5718923826/job/15497860801?pr=8102

Expected behavior

true passes and true failures

Chris-Hibbert commented 1 year ago

I don't think this package is being used for anything. I think it would be fine for it to disappear.

It's received recent enough attention to be included in .../vats/, so maybe there's more to it than I'm aware of. @dckc do you know of anything that's using this functionality?

turadg commented 1 year ago

I have a branch deleting it. Then I thought to check the docs site and it is documented: https://docs.agoric.com/reference/repl/sharingService.html

I don't know how that fits into any platform or product requirements though.

dckc commented 1 year ago

@FUDCo the test that's flaking is swingsetTests › sharingService › sharing › run sharing Demo --Two Party handoff. It looks like a pretty straightforward swingset test. I don't see any reason why it should flake. Do you?

cc @mhofman

dckc commented 1 year ago

do you know of anything that's using this functionality?

Nothing critical.

It's a pretty nifty pattern. But that's not enough reason to spend significant maintenance effort on it at this point.

warner commented 1 year ago

We can't find any obvious reason why this should be hanging. It feels like maybe GitHub CI is having problems spawning xsnap child processes? To learn more we'd probably want to record a slogfile (which is easy enough), and arrange for CI to upload the slogfile as an "artifact" (which requires more work, and begs the question of how general we should make the mechanism).

How frequently is this one failing?

turadg commented 1 year ago

How frequently is this one failing?

I don't know but it's about to be moot :) Removal is in the merge queue.

Chris-Hibbert commented 1 year ago

Given our current tooling, I think the only related question we could answer it "How frequently have we noticed this failing?" At ${job-1}, the CI infrastructure tracked flakey tests by tracking all test failures at a fine grain, and making it easy to click through from a failing test to the recent history of pass/fail for that test case.

This was very helpful in allowing us to see tests that were failing more than 5% of the time, and target those to be fixed. What would it take for us to build tooling like that?

turadg commented 1 year ago

re: Tooling, we have DataDog's Flaky Test Management enabled but it's not detecting anything. Maybe we don't have have it configured properly.