fedimint / fedimint

Federated E-Cash Mint
https://fedimint.org/
MIT License
536 stars 209 forks source link

mint_client_sanity flake: losing 1msat balance #5195

Closed dpc closed 4 days ago

dpc commented 2 weeks ago
 fedimint-test-all-wasm32-unknown-unknown-ci> 00:00:43 2024-05-02T20:26:22.963236Z  INFO mint_client_sanity: receiver info info={"federation_id":"7267eec085ffc1b2c3289a043741bc39e8036930679e71c387d9a1011b6eb151","network":"regtest","meta":{"federation_name":"testfed"},"total_amount_msat":19,"total_num_notes":10,"denominations_msat":{"1":5,"2":3,"4":2}}
fedimint-test-all-wasm32-unknown-unknown-ci> 00:00:43 thread 'main' panicked at modules/fedimint-mint-tests/src/bin/mint-client-sanity.rs:75:5:
fedimint-test-all-wasm32-unknown-unknown-ci> 00:00:43 assertion `left == right` failed
fedimint-test-all-wasm32-unknown-unknown-ci> 00:00:43   left: 19
fedimint-test-all-wasm32-unknown-unknown-ci> 00:00:43  right: 20
fedimint-test-all-wasm32-unknown-unknown-ci> 00:00:43 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fedimint-test-all-wasm32-unknown-unknown-ci> ## FAIL: mint_client_sanity

https://github.com/fedimint/fedimint/actions/runs/8929917822/job/24528829288

It could be related to another flake we've seen where we are sometimes losing a note.

dpc commented 2 weeks ago

Ping @elsirion for awareness.

dpc commented 2 weeks ago
> retry-success.sh ./scripts/tests/mint-client-sanity.sh

reproduces on my system after a long while

dpc commented 2 weeks ago

Interesting observation: if you restore the client from passphrase it has correct balance. So it's definitely the client that lost something.

fedimint devimint> ./target-nix/debug/fedimint-cli --data-dir /tmp/devimint-2927716-596/clients/receiver-0/ info
{
  "federation_id": "45666e719bfe330b5d56d2796a5ad37e456801146fdd02e48a48ee69606775ae",
  "network": "regtest",
  "meta": {
    "federation_name": "testfed"
  },
  "total_amount_msat": 18,
  "total_num_notes": 12,
  "denominations_msat": {
    "1": 8,
    "2": 3,
    "4": 1
  }
}
fedimint devimint> ./target-nix/debug/fedimint-cli --data-dir /tmp/devimint-2927716-596/clients/receiver-restore info
{
  "federation_id": "45666e719bfe330b5d56d2796a5ad37e456801146fdd02e48a48ee69606775ae",
  "network": "regtest",
  "meta": {
    "federation_name": "testfed"
  },
  "total_amount_msat": 20,
  "total_num_notes": 13,
  "denominations_msat": {
    "1": 8,
    "2": 4,
    "4": 1
  }
}
elsirion commented 2 weeks ago

Do you have any logs from the run that failed? EDIT: was just running it myself to get some logs, is there a way to get the log output of fedimint-cli when called inside devimint?

maan2003 commented 2 weeks ago

there is Cmd::run_with_logging, you can't use out_string then

maan2003 commented 2 weeks ago

oh there is also FM_DEVIMINT_CMD_INHERIT_STDERR which will have stderr inherit

dpc commented 2 weeks ago

Yeah, I run with retry-success.sh env RUST_LOG=fm=debug,info FM_DEVIMINT_CMD_INHERIT_STDERR=1 ./scripts/tests/mint-client-sanity.sh, and also wrote a test in which a client remints over and over 1msat.

Edit: I spoke too soon and reminting 1msat is yet to repro the issue.

dpc commented 2 weeks ago

Argh. Yesterday I was able to repro 2 times, and today I'm running it for 2 hours already and nothing.

dpc commented 2 weeks ago

Ok, I think I've found something.

So one of the transactions was rejected:

2024-05-03T19:03:40.933046Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Finished executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 terminal=true outcome=Inactive { dyn_state: OperationState { operation_id: OperationId(0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505), state: Rejected(64f6dfaeef2f6c4f50c800b888a0b52569f5bf957068580358bf0a170a7e0a5f, "The transaction had an invalid input") } }

and later the refund transactions seem to also get rejected:

2024-05-03T19:03:40.960222Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505
2024-05-03T19:03:40.960325Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Finished executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 terminal=true outcome=Inactive { dyn_state: Input(MintInputStateMachine { common: MintInputCommon { operation_id: OperationId(0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505), txid: 45b20314ea996f24e28df4737c3ed5467c2413dbfae2f96f124abccecbd35cd1, input_idx: 5 }, state: Error(MintInputStateError { error: "Refund transaction 64f6dfaeef2f6c4f50c800b888a0b52569f5bf957068580358bf0a170a7e0a5f was rejected" }) }) }
2024-05-03T19:03:40.960360Z DEBUG task{name="state machine executor"}: fm::client::reactor: State transition complete operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 outcome_active=false total=29
2024-05-03T19:03:40.960369Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505
2024-05-03T19:03:40.960447Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Finished executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 terminal=true outcome=Inactive { dyn_state: Input(MintInputStateMachine { common: MintInputCommon { operation_id: OperationId(0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505), txid: 45b20314ea996f24e28df4737c3ed5467c2413dbfae2f96f124abccecbd35cd1, input_idx: 4 }, state: Error(MintInputStateError { error: "Refund transaction dcca71e514a0bc6be10655b0b4f200164e60c0ab767c0980fd393f4a6b369eb4 was rejected" }) }) }
2024-05-03T19:03:40.960474Z DEBUG task{name="state machine executor"}: fm::client::reactor: State transition complete operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 outcome_active=false total=28
2024-05-03T19:03:40.960481Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505
2024-05-03T19:03:40.960550Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Finished executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 terminal=true outcome=Inactive { dyn_state: Input(MintInputStateMachine { common: MintInputCommon { operation_id: OperationId(0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505), txid: 45b20314ea996f24e28df4737c3ed5467c2413dbfae2f96f124abccecbd35cd1, input_idx: 3 }, state: Error(MintInputStateError { error: "Refund transaction 5bf1403a7b0e8023d0a6f940c59ad60904f9c81132a02baaa1310a0f8e7254ae was rejected" }) }) }
2024-05-03T19:03:40.960578Z DEBUG task{name="state machine executor"}: fm::client::reactor: State transition complete operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 outcome_active=false total=27
2024-05-03T19:03:40.960586Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505
2024-05-03T19:03:40.960685Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Finished executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 terminal=true outcome=Inactive { dyn_state: Output(MintOutputStateMachine { common: MintOutputCommon { operation_id: OperationId(0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505), out_point: OutPoint { txid: 5bf1403a7b0e8023d0a6f940c59ad60904f9c81132a02baaa1310a0f8e7254ae, out_idx: 0 } }, state: Aborted(MintOutputStatesAborted) }) }
2024-05-03T19:03:40.960712Z DEBUG task{name="state machine executor"}: fm::client::reactor: State transition complete operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 outcome_active=false total=26
2024-05-03T19:03:40.960719Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505
2024-05-03T19:03:40.960825Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Finished executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 terminal=true outcome=Inactive { dyn_state: Output(MintOutputStateMachine { common: MintOutputCommon { operation_id: OperationId(0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505), out_point: OutPoint { txid: 64f6dfaeef2f6c4f50c800b888a0b52569f5bf957068580358bf0a170a7e0a5f, out_idx: 0 } }, state: Aborted(MintOutputStatesAborted) }) }
2024-05-03T19:03:40.960851Z DEBUG task{name="state machine executor"}: fm::client::reactor: State transition complete operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 outcome_active=false total=25
2024-05-03T19:03:40.960858Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505
2024-05-03T19:03:40.960960Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Finished executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 terminal=true outcome=Inactive { dyn_state: Output(MintOutputStateMachine { common: MintOutputCommon { operation_id: OperationId(0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505), out_point: OutPoint { txid: 927ffd0c048e358e7ce70a623525a51a42e55a8e4611658b74a5816fd731cfe3, out_idx: 0 } }, state: Aborted(MintOutputStatesAborted) }) }
2024-05-03T19:03:40.960987Z DEBUG task{name="state machine executor"}: fm::client::reactor: State transition complete operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 outcome_active=false total=24
2024-05-03T19:03:40.960995Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505
2024-05-03T19:03:40.961096Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Finished executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 terminal=true outcome=Inactive { dyn_state: Output(MintOutputStateMachine { common: MintOutputCommon { operation_id: OperationId(0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505), out_point: OutPoint { txid: dcca71e514a0bc6be10655b0b4f200164e60c0ab767c0980fd393f4a6b369eb4, out_idx: 0 } }, state: Aborted(MintOutputStatesAborted) }) }
2024-05-03T19:03:40.961124Z DEBUG task{name="state machine executor"}: fm::client::reactor: State transition complete operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 outcome_active=false total=23
2024-05-03T19:03:40.961131Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505
2024-05-03T19:03:40.961226Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Finished executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 terminal=true outcome=Inactive { dyn_state: Output(MintOutputStateMachine { common: MintOutputCommon { operation_id: OperationId(0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505), out_point: OutPoint { txid: 057bc08733c1467c2ce734640315cd33789eae17d8011f7dace76c931ca50612, out_idx: 0 } }, state: Aborted(MintOutputStatesAborted) }) }
2024-05-03T19:03:40.961251Z DEBUG task{name="state machine executor"}: fm::client::reactor: State transition complete operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 outcome_active=false total=22
2024-05-03T19:03:40.961258Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505
2024-05-03T19:03:40.961326Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Finished executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 terminal=true outcome=Inactive { dyn_state: Input(MintInputStateMachine { common: MintInputCommon { operation_id: OperationId(0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505), txid: 45b20314ea996f24e28df4737c3ed5467c2413dbfae2f96f124abccecbd35cd1, input_idx: 0 }, state: Error(MintInputStateError { error: "Refund transaction e7bf13a0153efadc1f6224ad87120d05d8bfc02fc2605823e7b4fa580d62eaa9 was rejected" }) }) }
2024-05-03T19:03:40.961354Z DEBUG task{name="state machine executor"}: fm::client::reactor: State transition complete operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 outcome_active=false total=21
2024-05-03T19:03:40.961360Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505
2024-05-03T19:03:40.961458Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Finished executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 terminal=true outcome=Inactive { dyn_state: Output(MintOutputStateMachine { common: MintOutputCommon { operation_id: OperationId(0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505), out_point: OutPoint { txid: e7bf13a0153efadc1f6224ad87120d05d8bfc02fc2605823e7b4fa580d62eaa9, out_idx: 0 } }, state: Aborted(MintOutputStatesAborted) }) }
2024-05-03T19:03:40.961485Z DEBUG task{name="state machine executor"}: fm::client::reactor: State transition complete operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 outcome_active=false total=20
2024-05-03T19:03:40.961493Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505
2024-05-03T19:03:40.961561Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Finished executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 terminal=true outcome=Inactive { dyn_state: Input(MintInputStateMachine { common: MintInputCommon { operation_id: OperationId(0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505), txid: 45b20314ea996f24e28df4737c3ed5467c2413dbfae2f96f124abccecbd35cd1, input_idx: 2 }, state: Error(MintInputStateError { error: "Refund transaction 057bc08733c1467c2ce734640315cd33789eae17d8011f7dace76c931ca50612 was rejected" }) }) }
2024-05-03T19:03:40.961588Z DEBUG task{name="state machine executor"}: fm::client::reactor: State transition complete operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 outcome_active=false total=19
2024-05-03T19:03:40.961595Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505
2024-05-03T19:03:40.961660Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505}: fm::client::reactor: Finished executing state transition operation_id=0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505 terminal=true outcome=Inactive { dyn_state: Input(MintInputStateMachine { common: MintInputCommon { operation_id: OperationId(0f60d05db2e0ae76f7411ec01823809ee7fc339bafb7d26848608b94aad7d505), txid: 45b20314ea996f24e28df4737c3ed5467c2413dbfae2f96f124abccecbd35cd1, input_idx: 1 }, state: Error(MintInputStateError { error: "Refund transaction 927ffd0c048e358e7ce70a623525a51a42e55a8e4611658b74a5816fd731cfe3 was rejected" }) }) }

The reason why we might be missing why the transaction rejection was missed on a higher level is using reissue --no-wait and then dev wait-completion.

However it's unclear why would it get rejected in the first place (possibly we're trying to spend the output too fast?), and then why would refund fail (something retried?). I'm guessing

dpc commented 2 weeks ago

With more logging #5206:

e5e23}: fm::client::reactor: Finished executing state transition operation_id=372704d93438d666c8a84d3084a15a99c111d8b6fed204b8e61c3dfe9a3e5e23 terminal=true outcome=In
active { dyn_state: OperationState { operation_id: OperationId(372704d93438d666c8a84d3084a15a99c111d8b6fed204b8e61c3dfe9a3e5e23), state: Rejected(0x6d8df69ecbdd3daf022
c10d422b071a0352891ba179df6a2c1fdf3628f147ac9, "The transaction had an invalid input: 1-The note is already spent") } }

and then:

2024-05-03T23:33:28.356161Z  WARN task{name="state machine executor"}:state_machine_transition{operation_id=372704d93438d666c8a84d3084a15a99c111d8b6fed204b8e61c3dfe9a3
e5e23}: fm::client::module::mint: Refund transaction rejected. Notes probably lost. err=The transaction had an invalid input: 1-The note is already spent refund_txid=1
1ea6763ec3feeec06ca1c0c28232410b90e8e28b23ecda42679b5a4be957a3a

OK, makes sense. Input was already spent, trying to refund it fails the same way.

Now ... why is it spent?

dpc commented 2 weeks ago

How am I going to figure out at which stage the input became spent...

It could be that receiver on spend somehow produced a faulty note.

Or reissue either spent a note and somehow left it as spendable, or produced an already spent note.

bradleystachurski commented 1 week ago

@dpc I wasn't sure if the this is the same root issue that caused this wasm flake, but feel free to close as a dupe if you think it is.

https://github.com/fedimint/fedimint/issues/5210

justinmoon commented 1 week ago

dev call: dpc and eric both can reproduce (but it takes hours to hit)

dpc commented 1 week ago

OK, so I think I have a clue:

We submit a tx 35b5c2d3d99215baae9f71...:

2024-05-07T06:13:10.251926Z  WARN fm::client: Submitting transaction txid=35b5c2d3d99215baae9f71ce0d2dbaf8d194ddb366c32a9364b4121492c89a36 transaction=Transaction { inputs: [V0(MintInputV0 { amount: Amount { msats: 1 }, note: Note {
 nonce: Nonce(PublicKey(bdc6118ac1e9eb8256280a64d0fed75055a539ca0e03f22f182094ed517b3a5b2c599be048b101f731da233b60ebe78e266ba231facc1d029ff5ccdd4fed688b)), signature: Signature(G1Affine { x: 0x070263239e67d52a687e9ea9f278e496568a645
051a1988661bd1049330f892ac2efb170be1d25190c771bd8eec1f242, y: 0x138c65c09827904b677760e9aa63827b4ac94bf7266af5a8a16bd6c9303ec9a15e8b774c5f2a317366f7779676d7e555, infinity: Choice(0) }) } })], outputs: [V0(MintOutputV0 { amount: Amou
nt { msats: 1 }, blind_nonce: BlindNonce(BlindedMessage(G1Affine { x: 0x08ad882bd7dba54672ac3f3b1959053cbea841a9138813233d888641d0430e176cd358ff31b5229b5c9fe49d62aebe9d, y: 0x08a00ad7b821de685f43f005a60966ee43129ab8b685c0762ccb962b5
77069391b9434e463d6c5bf9e34e166cbb9e1eb, infinity: Choice(0) })) })], nonce: [94, 92, 27, 204, 71, 80, 103, 154], signatures: NaiveMultisig([Signature(85f260e71f598539b74bac6b377ab3f6414d9f50bef5e1ae616359f5d2aea1fa965eb9de7e4e5f67e
7d3139f37a6de118254fbbd624eaf96f611aa4515983108)]) }

The output state machine says it succeeded:

2024-05-07T06:13:10.494432Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=611aee7b8f9bb7772fbdcac1139d91eaa679f9776c1fe51907ebe278023cc474}: fm::client::reactor: Finished executing state transition
operation_id=611aee7b8f9bb7772fbdcac1139d91eaa679f9776c1fe51907ebe278023cc474 terminal=true outcome=Inactive { dyn_state: Output(MintOutputStateMachine { common: MintOutputCommon { operation_id: OperationId(611aee7b8f9bb7772fbdcac11
39d91eaa679f9776c1fe51907ebe278023cc474), out_point: OutPoint { txid: 0x35b5c2d3d99215baae9f71ce0d2dbaf8d194ddb366c32a9364b4121492c89a36, out_idx: 0 } }, state: Succeeded(MintOutputStatesSucceeded { amount: Amount { msats: 1 } }) })
 }

But the input state machine fails:

2024-05-07T06:13:10.494471Z DEBUG task{name="state machine executor"}: fm::client::reactor: State transition complete operation_id=611aee7b8f9bb7772fbdcac1139d91eaa679f9776c1fe51907ebe278023cc474 outcome_active=false total=7
2024-05-07T06:13:10.494521Z DEBUG task{name="state machine executor"}: fm::client::reactor: Started new active state machine. operation_id=85e9b85c0b8d6839ea54e8c019a3cd476c5c4da50997923b4f5d45d41c95694b total=8 transitions_num=2
2024-05-07T06:13:10.494527Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=611aee7b8f9bb7772fbdcac1139d91eaa679f9776c1fe51907ebe278023cc474}: fm::client::reactor: Executing state transition operation
_id=611aee7b8f9bb7772fbdcac1139d91eaa679f9776c1fe51907ebe278023cc474
2024-05-07T06:13:10.494592Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=611aee7b8f9bb7772fbdcac1139d91eaa679f9776c1fe51907ebe278023cc474}: fm::client::reactor: Finished executing state transition
operation_id=611aee7b8f9bb7772fbdcac1139d91eaa679f9776c1fe51907ebe278023cc474 terminal=true outcome=Inactive { dyn_state: OperationState { operation_id: OperationId(611aee7b8f9bb7772fbdcac1139d91eaa679f9776c1fe51907ebe278023cc474),
state: Rejected(0x35b5c2d3d99215baae9f71ce0d2dbaf8d194ddb366c32a9364b4121492c89a36, "The transaction 35b5c2d3d99215baae9f71ce0d2dbaf8d194ddb366c32a9364b4121492c89a36 had an invalid input at index 0: 1-The note is already spent (Nonc
e(PublicKey(bdc6118ac1e9eb8256280a64d0fed75055a539ca0e03f22f182094ed517b3a5b2c599be048b101f731da233b60ebe78e266ba231facc1d029ff5ccdd4fed688b)))") } }

Output state machines wait only for AWAIT_OUTPUT_OUTCOME_ENDPOINT. But input state machines await on on transaction_update_stream.

Which suggests that our idempotency of tx submission is borked. Or I messed up something and confused myself.

@elsirion

dpc commented 1 week ago

:thinking:

diff --git a/fedimint-server/src/net/api.rs b/fedimint-server/src/net/api.rs
index 8fc49260f1..4a1cccfcc4 100644
--- a/fedimint-server/src/net/api.rs
+++ b/fedimint-server/src/net/api.rs
@@ -114,11 +114,10 @@ impl ConsensusApi {

         debug!(%txid, "Received mint transaction");

+        // Create read-only DB tx so that the read state is consistent
+        let mut dbtx = self.db.begin_transaction_nc().await;
         // we already processed the transaction before
-        if self
-            .db
-            .begin_transaction()
-            .await
+        if dbtx
             .get_value(&AcceptedTransactionKey(txid))
             .await
             .is_some()
@@ -126,9 +125,6 @@ impl ConsensusApi {
             return Ok(txid);
         }

-        // Create read-only DB tx so that the read state is consistent
-        let mut dbtx = self.db.begin_transaction_nc().await;
-
         // We ignore any writes, as we only verify if the transaction is valid here
         dbtx.ignore_uncommitted();

Will leave it overnight and see how it went.

elsirion commented 1 week ago

Great catch! That could very well be the culprit, starting the test script on my server too.

dpc commented 1 week ago

5226 but I'm still able to repro, so either red herring or didn't help. Will keep debugging. Bummer, looked like a perfect candidate.

elsirion commented 1 week ago

Taking a step back https://github.com/fedimint/fedimint/issues/5195#issuecomment-2097657170 doesn't explain losing money imo since the output was processed successfully. There's probably another anomaly that causes the loss, what I'd expect is that output processing failed since you noticed earlier that money reappears when recovering from seed.

dpc commented 1 week ago

The good part is that I'm improving the debugability of all the things here, so it should be easier to figure out issues like this in the future.

dpc commented 1 week ago

Why does everything looks always OK?

The note 523bf09c2d77add20b0a957477e6 get created as an output to reissue of other notes in the sender:

2024-05-08T07:24:59.365868Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=fd6e3bb89c30af2526bd513e57a16af5a2b6d08d14682b35996eb1ebd43e14a4}: fm::client::module::mint: Adding new note from transactio
n output amount=1 msat note=SpendableNote { signature: Signature(G1Affine { x: 0x0fb47ea115f2ec53a20bdc67350b31b4d6ce475da8f2440b8dd466e824ecb27c081b22eb7e6c5b08981d977af6666a31, y: 0x15ab63eb442bdc80fea2c00d78ffbc6fb83ab7d1d44b1ffd
813d88eba0bd4e204d500b61c8c105349d155b26615ebc36, infinity: Choice(0) }), spend_key: KeyPair(#19268168bfca36e0), nonce: Nonce(PublicKey(523bf09c2d77add20b0a957477e6d95a2d3a00a014a225ea31bf50be13da876f08d9e6523b895912c00c7c4ad8c32f60
c1623ea6437d0ddc427051ed76b5a87e)) }

Then it got spent as OOB as AgEEMOTxLABTAQEBr7R+oRXy7FOiC9xnNQsxtNbOR12o8kQLjdRm6CTssnwIGyLrfmxbCJgdl3r2ZmoxoRvISk5SvtAr8owWHSf6NjNOKLOI7GwA4r/WdTXO+U8=:

2024-05-08T07:24:59.694719Z DEBUG fm::client::module::mint: Spending note as oob amount=1 msat note=SpendableNote { signature: Signature(G1Affine { x: 0x0fb47ea115f2ec53a20bdc67350b31b4d6ce475da8f2440b8dd466e824ecb27c081b22eb7e6c5b0
8981d977af6666a31, y: 0x15ab63eb442bdc80fea2c00d78ffbc6fb83ab7d1d44b1ffd813d88eba0bd4e204d500b61c8c105349d155b26615ebc36, infinity: Choice(0) }), spend_key: KeyPair(#19268168bfca36e0), nonce: Nonce(PublicKey(523bf09c2d77add20b0a9574
77e6d95a2d3a00a014a225ea31bf50be13da876f08d9e6523b895912c00c7c4ad8c32f60c1623ea6437d0ddc427051ed76b5a87e)) }
...
2024-05-08T07:24:59.708802Z  INFO mint_client_sanity: sender adding to notes for the receiver info={"federation_id":"30e4f12c5b44533480e725c920356f1fa8a12d26387eeef9775fec701053b33f","network":"regtest","meta":{"federation_name":"te
stfed"},"total_amount_msat":9999988,"total_num_notes":67,"denominations_msat":{"1":2,"2":9,"4":8,"8":2,"16":5,"32":3,"64":2,"128":2,"256":2,"512":3,"1024":3,"2048":2,"4096":3,"8192":2,"16384":2,"32768":3,"65536":2,"131072":2,"262144
":2,"524288":3,"1048576":3,"2097152":2}} notes=AgEEMOTxLABTAQEBr7R+oRXy7FOiC9xnNQsxtNbOR12o8kQLjdRm6CTssnwIGyLrfmxbCJgdl3r2ZmoxoRvISk5SvtAr8owWHSf6NjNOKLOI7GwA4r/WdTXO+U8=

Then the receiver reissues it in tx 77bde4dec3fa430c67d0c34c0f030f810f57d6df89515d439534895f285c89ff :

2024-05-08T07:25:05.255727Z DEBUG fm::devimint: > fedimint-cli --data-dir=/tmp/devimint-657359-466/clients/receiver-0 reissue --no-wait AgEEMOTxLABTAQEBr7R+oRXy7FOiC9xnNQsxtNbOR12o8kQLjdRm6CTssnwIGyLrfmxbCJgdl3r2ZmoxoRvISk5SvtAr8owW
HSf6NjNOKLOI7GwA4r/WdTXO+U8=
...
2024-05-08T07:25:05.349580Z  WARN fm::client: Submitting transaction txid=77bde4dec3fa430c67d0c34c0f030f810f57d6df89515d439534895f285c89ff transaction=Transaction { inputs: [V0(MintInputV0 { amount: Amount { msats: 1 }, note: Note {
 nonce: Nonce(PublicKey(523bf09c2d77add20b0a957477e6d95a2d3a00a014a225ea31bf50be13da876f08d9e6523b895912c00c7c4ad8c32f60c1623ea6437d0ddc427051ed76b5a87e)), signature: Signature(G1Affine { x: 0x0fb47ea115f2ec53a20bdc67350b31b4d6ce475
da8f2440b8dd466e824ecb27c081b22eb7e6c5b08981d977af6666a31, y: 0x15ab63eb442bdc80fea2c00d78ffbc6fb83ab7d1d44b1ffd813d88eba0bd4e204d500b61c8c105349d155b26615ebc36, infinity: Choice(0) }) } })], outputs: [V0(MintOutputV0 { amount: Amou
nt { msats: 1 }, blind_nonce: BlindNonce(BlindedMessage(G1Affine { x: 0x0e14a6c2bdc546dd90a2956a461c88279fb092069a21198b49ca43c6ac6584e641583869480a4886270af20b1b1f2251, y: 0x058020fec2fc4f03813c3e9878585ac4954d7013ab57d0025c0c4f4a4
7202ab2404c3ced021d8956598d655fce827161, infinity: Choice(0) })) })], nonce: [203, 129, 244, 181, 108, 225, 25, 238], signatures: NaiveMultisig([Signature(07c5350d92b169bf7f69ce612ef823532e8362f1f3fc4538d37855e50f9ead8a3cad9c40198cd
3f6ffd3369ee2938af56073efac21e4f457e363f25782e2f845)]) }

but it fails:

2024-05-08T07:25:05.668673Z DEBUG task{name="state machine executor"}:state_machine_transition{operation_id=449a45443466a6ec5b0ef60d60b23b274eed9dd1f3ab2d13fef7c3d6cd5ea1b4}: fm::client::reactor: Finished executing state transition
operation_id=449a45443466a6ec5b0ef60d60b23b274eed9dd1f3ab2d13fef7c3d6cd5ea1b4 terminal=true outcome=Inactive { dyn_state: OperationState { operation_id: OperationId(449a45443466a6ec5b0ef60d60b23b274eed9dd1f3ab2d13fef7c3d6cd5ea1b4),
state: Rejected(0x77bde4dec3fa430c67d0c34c0f030f810f57d6df89515d439534895f285c89ff, "The transaction 77bde4dec3fa430c67d0c34c0f030f810f57d6df89515d439534895f285c89ff had an invalid input at index 0: 1-The note is already spent (Nonc
e(PublicKey(523bf09c2d77add20b0a957477e6d95a2d3a00a014a225ea31bf50be13da876f08d9e6523b895912c00c7c4ad8c32f60c1623ea6437d0ddc427051ed76b5a87e)))")

Adding more server-side logging to track when exactly this note got spent.

dpc commented 1 week ago

It seems the issue is server side:

First the server does the process_input check, without commiting:

2024-05-10T03:56:55.091881Z DEBUG fm::module::mint: Marking note as spent nonce=02143e021511766d280b0cacdd15a2f19dd6851aef1d21e055299b5fc9ecc8e8a9

Then the consensus processes the tx and everything commits:

2024-05-10T03:56:55.196421Z DEBUG fm::module::mint: Marking note as spent nonce=02143e021511766d280b0cacdd15a2f19dd6851aef1d21e055299b5fc9ecc8e8a9

But on the next submit_tx with the exact same transaction:

2024-05-10T03:56:55.319340Z DEBUG fm::module::mint: Marking note as spent nonce=02143e021511766d280b0cacdd15a2f19dd6851aef1d21e055299b5fc9ecc8e8a9
2024-05-10T03:56:55.319368Z DEBUG fm::module::mint: Note was already spent nonce=02143e021511766d280b0cacdd15a2f19dd6851aef1d21e055299b5fc9ecc8e8a9

which seems the code failed to detect that transaction was processed, but somehow (in the same dbtx) sees the note was marked as spent, which happened on the same write tx when tx was processed by the consensus.

Either we experience dirty-reads, or we somehow failed to write accepted tx, but marked inputs as spent, or something very weird is going on.

Edit: Oh, actually I'm not sure if the tx is exactly the same because debug output is truncated. I'm wondering if maybe client is not recalculating nonce each time or something. Edit2: Looking at the client code, I don't see a way the tx would change. Everything suggests that somehow on re-submitting, we sometimes fail to see the record marking it as accepted already.

dpc commented 6 days ago

@elsirion @m1sterc001guy

OMG.

There's nothing guaranteeing that the db view does not change from under us, since we are not running with "serializable" isolation level. Just read through https://stackoverflow.com/a/4036063/134409 .

By default we are only getting "read committed". We could update to "repeatable reads", but it doesn't really matter here.

The sequence is as follows:

For some reason I never really internalized that the db can't guarantee consistent read-view of the database in the transaction. Only after seeing it step by step in the log now, with all the needed data it clicked.

Please confirm if this makes sense, or maybe it's just too late and I lost the plot.

The simple solution here, would be to change the code to check for AcceptedTransactionKey both before processing on submit_transaction and after, if there was any error. Or wrap epoch processing in a big RWLock, where consensus item processing would lock everything out via write.

elsirion commented 4 days ago

@dpc @m1sterc001guy I think we (at least me) always operated under the assumption that RocksDB provides snapshot isolation :grimacing: Didn't we even have a whole test suite for that? Imo we should just get the closest to that we can, dealing with lower isolation levels manually isn't worth it at all.

EDIT: @dpc I don't think this is the problem. We do set the snapshot when creating the DBTX and even have a test for read repeatability:

https://github.com/fedimint/fedimint/blob/757c5ff569f481e1c22ae38e6958ec68dde36d35/fedimint-rocksdb/src/lib.rs#L452-L458

EDIT2: The one thing that we don't prevent on the DB layer is:

Fixing that would probably require serializability?

m1sterc001guy commented 4 days ago

Yeah I believe we have snapshot isolation. When creating the rocksdb transaction, we set the snapshot like this. Here was the commit.

async fn begin_transaction<'a>(&'a self) -> RocksDbTransaction {
        let mut optimistic_options = OptimisticTransactionOptions::default();
        optimistic_options.set_snapshot(true);

        let mut write_options = WriteOptions::default();
        // Make sure we never lose data on unclean shutdown
        write_options.set_sync(true);

        let mut rocksdb_tx =
            RocksDbTransaction(self.0.transaction_opt(&write_options, &optimistic_options));
        rocksdb_tx
            .set_tx_savepoint()
            .await
            .expect("setting tx savepoint failed");

        rocksdb_tx
    }

As Eric pointed out, we have a set of tests for detecting the other anomalies present at lower isolation levels.

Fixing that would probably require serializability?

Yes I believe so. Snapshot isolation provides a read-consistent view of the database at the time of the transaction creation. (here's a good example post about the differences between SI and Serializable) Since in your example above, the data that tx1 is operating over changes during the transaction, it will not have an up to date version of key1 at step 4.

I've been moving this week so I haven't been able to keep up with this whole thread, but do we think our isolation expectations have changed because of this bug? Perhaps we can discuss on the dev call.

dpc commented 4 days ago

BTW. https://github.com/fedimint/fedimint/issues/5271 is another place confirm something is of. Otherwise - how could you in the same db tx selected a note, and then fail to return it when you're deleting it.

Check https://github.com/fedimint/fedimint/actions/runs/9041274494/job/24860479614?pr=5259

 00:08:31     2024-05-12T01:46:37.008223Z DEBUG fm::client::module::mint: Will consolidate excessive notes note_num=8 denominations_msats=[32768, 32768, 32768, 65536, 65536, 65536, 65536, 65536]

So consolidation code selected notes

00:08:31     2024-05-12T01:46:37.008513Z DEBUG fm::client::module::mint: Consolidating note amount=32768 msat note=029dba9ae7bc2ebeb94fc015e202c6767d7d91b44fff78490566fd3c948ff9bb4c

This repeats 8 times for each note being deleted.

00:08:31     2024-05-12T01:46:37.033450Z DEBUG timing: Operation timing name=autocommit - commit_tx duration_ms=0

And it commited.

But then right afterwards the commit there's another task that selected the same notes:

00:08:31     2024-05-12T01:46:37.034757Z DEBUG spawn{task="sm-executor"}:sm_transition{operation_id=7b85301d0ad8e2c4c7a384d23f1432bebbb7f44d2d18ad51f5f5a6c52e7157b8}: fm::client::module::mint: Will consolidate excessive notes note_num=8 denominations_msats=[32768, 32768, 32768, 65536, 65536, 65536, 65536, 65536]

which I believe is because the that dbtx was opened before the previous one commited, as these functions are parts of bigger and longer dbtx-es.

and it inevitably fails on trying to delete a missing note:

00:08:31     2024-05-12T01:46:37.034886Z DEBUG spawn{task="sm-executor"}:sm_transition{operation_id=7b85301d0ad8e2c4c7a384d23f1432bebbb7f44d2d18ad51f5f5a6c52e7157b8}: fm::client::module::mint: Consolidating note amount=32768 msat note=02207c5f48aee14a24c5ed153765ce7796d48301c99b23ddfa238069c5aa16cf89
00:08:31     thread 'tokio-runtime-worker' panicked at modules/fedimint-mint-client/src/lib.rs:1577:10:
00:08:31     Must deleted existing spendable note00:08:31     2024-05-12T01:46:37.034886Z ```

As you can see one of the consolidations was part of a sm_transition, the other one was not.

But if everything was working the second transaction should either not see these notes anymore, or still see them an successfully deleted them, right?

@m1sterc001guy @elsirion

m1sterc001guy commented 4 days ago

Ah I see so, if I understand correctly, dbtx2 starts before dbtx1 commits, so it still sees the notes as available, tries to delete them, and fails since they're already deleted.

This is possible with Snapshot isolation, since SI only guarantees that there will be a consistent read at the time of transaction creation. SI is mainly useful in systems that don't have long running transactions that will conflict with each other (its also why we use optimistic transactions, since they avoid locking at the db layer).

Serializable would prevent this, it would essentially block the second transaction from making progress at all until dbtx1 is committed, since the first transaction would acquire a lock over the page where the note is held.

Last time I looked at this I think we had determined SI would be a sufficient isolation level, but maybe with this note consolidation we should reconsider this? Are we doing these consolidations in parallel with other spends?

dpc commented 4 days ago

Calling set_snapshot will not affect the version of Data returned by get methods.

https://docs.rs/rocksdb/latest/rocksdb/struct.TransactionOptions.html#method.set_snapshot

This looks sus to me.

@m1sterc001guy

dpc commented 4 days ago

We've found the issue with @m1sterc001guy ! :partying_face: :tada: :taco:

dpc commented 4 days ago

@m1sterc001guy @elsirion https://github.com/fedimint/fedimint/pull/5279