fedimint / fedimint

Federated E-Cash Mint
https://fedimint.org/
MIT License
536 stars 210 forks source link

Intermittant CI failure tracking issue #222

Closed justinmoon closed 1 year ago

justinmoon commented 1 year ago

We've had a handful of CI failure that seem to disappear when you re-run them. Making an issue to keep track of them.

jkitman commented 1 year ago

Basically this occurs when the listfunds command doesn't return the updated balances right away. It started when we added lightningd --dev-fast-gossip.

I added a sleep and FIXME to that line of code. I believe this will make the CI reliable, but hopefully we can create a better long-term fix.

I was debating removing this assertion, maybe there is a better way of verifying that actual LN funds have moved.

elsirion commented 1 year ago

Maybe just wait/poll till it has the expected value and timeout otherwise?

justinmoon commented 1 year ago

Here's another one https://github.com/fedimint/minimint/runs/7290083028?check_suite_focus=true#step:7:2246

NicolaLS commented 1 year ago

Here's another one https://github.com/fedimint/minimint/runs/7290083028?check_suite_focus=true#step:7:2246

wanted to post the same in here, I get this a lot on my local setup especially on this branch

elsirion commented 1 year ago

https://github.com/fedimint/minimint/runs/7360992558?check_suite_focus=true

NicolaLS commented 1 year ago

https://github.com/fedimint/fedimint/runs/7881383484?check_suite_focus=true

NicolaLS commented 1 year ago

https://github.com/fedimint/fedimint/runs/7881960910?check_suite_focus=true

justinmoon commented 1 year ago

https://github.com/fedimint/fedimint/runs/7881960910?check_suite_focus=true

Port 8080 already bound, again https://github.com/fedimint/fedimint/runs/7881960910?check_suite_focus=true#step:5:6101

justinmoon commented 1 year ago

https://github.com/fedimint/fedimint/actions/runs/3049065296/jobs/4914765743

jkitman commented 1 year ago

@justinmoon Did using port picker resolve these issues?

justinmoon commented 1 year ago

I don't know because it only happens very infrequently. Do you want to just add it? https://github.com/fedimint/fedimint/pull/427

NicolaLS commented 1 year ago

@jkitman could this maybe be the same issue I had ? (which got fixed by running the tests with --release) I also got the port already in use error (for somereason?) when that happened

jkitman commented 1 year ago

@NicolaLS I don't think so because we removed the timeouts which was the issue you were hitting. Still, shouldn't hurt to add port-picker to the integration tests as well...

elsirion commented 1 year ago

https://github.com/fedimint/fedimint/actions/runs/3112194306/jobs/5045353221#step:11:6632

justinmoon commented 1 year ago

https://github.com/fedimint/fedimint/actions/runs/3112194306/jobs/5045353221#step:11:6632

Exact same error here https://github.com/fedimint/fedimint/actions/runs/3123610337/jobs/5066377856#step:11:12449

In this case re-running CI fixed it. I'm inclined to believe this is just core-lightning being flaky.

justinmoon commented 1 year ago

Asked about this in core-lightning discord https://discord.com/channels/899980449231814676/899989729183940629/1023743433543782470

Rusty's response:

Yes, you are mining too fast, and we're freaking out. Generally this means your tests need to make sure everything is fully settled (listpeers channels has an empty htlcs array) before mining more. Or, make sure nodes have digested current blocks before making payment (see getinfo blockheight). Finally, note the dev-bitcoin-poll option for developer builds, which ll can reduce the 60 second polling interval

justinmoon commented 1 year ago

https://github.com/fedimint/fedimint/actions/runs/3165116609/jobs/5153897116

elsirion commented 1 year ago

Client loses money. https://github.com/fedimint/fedimint/actions/runs/3543886489/jobs/5950840882

jkitman commented 1 year ago

Client loses money. https://github.com/fedimint/fedimint/actions/runs/3543886489/jobs/5950840882

@elsirion Any insight into how this happens or whether it's a real bug?

elsirion commented 1 year ago

Client loses money. https://github.com/fedimint/fedimint/actions/runs/3543886489/jobs/5950840882

@elsirion Any insight into how this happens or whether it's a real bug?

The spend_ecash function seems faulty, let mut tx = TransactionBuilder::default(); should be inside the loop imo so we don't issue the same e-cash token twice if the DB tx fails for some reason (why it would, idk):

https://github.com/fedimint/fedimint/blob/13a0e24e902f63fae682c4e4af6e02069621086a/client/client-lib/src/lib.rs#L460-L499

EDIT: on a second thought, the tx should fail because of the input side double spend in that case …

m1sterc001guy commented 1 year ago

Hitting a non-deterministic issue with lightning_gateway_pays_internal_invoice

https://github.com/fedimint/fedimint/actions/runs/3808388883/jobs/6478925802

Looks like something is timing out

last 10 log lines:

---- lightning_gateway_pays_internal_invoice stdout ---- thread 'lightning_gateway_pays_internal_invoice' panicked at 'called Result::unwrap() on an Err value: ClientError(MintApiError(Timeout))', integrationtests/tests/tests.rs:456:14

failures: lightning_gateway_pays_internal_invoice

test result: FAILED. 27 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 42.64s

justinmoon commented 1 year ago

Just saw ^^ in CI https://github.com/fedimint/fedimint/actions/runs/3870489521/jobs/6597452580

justinmoon commented 1 year ago

lightning_gateway_pays_internal_invoice just hung and caused timeout https://github.com/fedimint/fedimint/actions/runs/3934580638/jobs/6729458243

maan2003 commented 1 year ago

thread 'lightning_gateway_claims_refund_for_internal_invoice' panicked at 'called Result::unwrap() on an Err value: NoGateways' https://github.com/fedimint/fedimint/actions/runs/3941469567/jobs/6743919007

maan2003 commented 1 year ago

lightning_gateway_pays_internal_invoice just hung and caused timeout https://github.com/fedimint/fedimint/actions/runs/3934580638/jobs/6729458243

this is very common, happened to me twice today.

elsirion commented 1 year ago

https://github.com/fedimint/fedimint/actions/runs/3953265674/attempts/1

maan2003 commented 1 year ago

https://github.com/fedimint/fedimint/actions/runs/3998730908/jobs/6861744016

elsirion commented 1 year ago

https://github.com/fedimint/fedimint/actions/runs/4087609870/jobs/7048326190#step:5:7789

justinmoon commented 1 year ago

https://github.com/fedimint/fedimint/actions/runs/4165714395/jobs/7209097454

maan2003 commented 1 year ago

https://github.com/fedimint/fedimint/actions/runs/4304534238/jobs/7505698626

jkitman commented 1 year ago

https://github.com/fedimint/fedimint/actions/runs/4304534238/jobs/7505698626

@Maan2003 I think this is the same issue we patched for bitcoin rpc, now appearing in electrum since we have tests for it. See https://github.com/fedimint/fedimint/pull/675/files#diff-fbf8716fe493eff59dfa648a1b7b393abd944c21095df0947b597a068158bedaR200-R203

justinmoon commented 1 year ago

image

https://github.com/fedimint/fedimint/actions/runs/4548773748/jobs/8020139933 (can't get a link to that line for some reason)

Should fix that panic.

maan2003 commented 1 year ago

https://github.com/fedimint/fedimint/actions/runs/4617485262/jobs/8163798744?pr=2125#step:10:40716 rust test fail

jkitman commented 1 year ago

https://github.com/fedimint/fedimint/actions/runs/4630289726/jobs/8191671471?pr=2145


fedimint-test-all> 00:03:45 thread 'lightning_gateway_pays_outgoing_invoice' panicked at 'failed to get info: Status { code: Unknown, message: "unable to sync PoV of the wallet with current best block in the main chain: status code: 503, response: \"Work queue depth exceeded\"", metadata: MetadataMap { headers: {"content-type": "application/grpc"} }, source: None }', integrationtests/tests/fixtures/real.rs:168:14```
justinmoon commented 1 year ago

Also just saw ^^ https://github.com/fedimint/fedimint/actions/runs/4635132634/jobs/8201925467

dpc commented 1 year ago

I've fixed recently bunch of flakes, I think it's better now (thought there's still something failing every now and then).

This issue is not very useful if over a year people report different issues.

In the future please when you find a flake, find most plausible/distinct root error/failure message and report a new issue for each different type of a flake.