IntersectMBO / cardano-node

The core component that is used to participate in a Cardano decentralised blockchain.
https://cardano.org
Apache License 2.0
3.06k stars 720 forks source link

[Windows] [vasil-testnet-v1] the node stops responding to `query tip` and the socket is not found anymore #3906

Open dorin100 opened 2 years ago

dorin100 commented 2 years ago

Internal/External Internal if an IOHK staff member.

Area node syncing on windows

Summary When using tag vasil-testnet-v1, only on Windows, in 3 out of 4 cases, the node stops responding to query tip and the socket is not found anymore.

Steps to reproduce Steps to reproduce the behavior:

  1. Start node on windows using tag vasil-testnet-v1
  2. query the tip of the node each 10s until the node is fully synced (syncProgress = 100)
    • https://github.com/input-output-hk/cardano-node-tests/runs/6544279877?check_suite_focus=true
    • waiting for the node to sync actual_era : byron - actual_epoch: 0 - actual_block: 309 - actual_slot : 335 - syncProgress: 0 actual_era : shelley - actual_epoch: 282 - actual_block: 101157 - actual_slot : 2022499 - syncProgress: 3 actual_era : shelley - actual_epoch: 583 - actual_block: 206081 - actual_slot : 4185025 - syncProgress: 7 actual_era : shelley - actual_epoch: 897 - actual_block: 316433 - actual_slot : 6447080 - syncProgress: 11 actual_era : shelley - actual_epoch: 1225 - actual_block: 423988 - actual_slot : 8806800 - syncProgress: 15 actual_era : allegra - actual_epoch: 1539 - actual_block: 532758 - actual_slot : 11071320 - syncProgress: 19 actual_era : allegra - actual_epoch: 1907 - actual_block: 639587 - actual_slot : 13720235 - syncProgress: 24 actual_era : mary - actual_epoch: 2206 - actual_block: 746268 - actual_slot : 15871297 - syncProgress: 28 actual_era : mary - actual_epoch: 2556 - actual_block: 852273 - actual_slot : 18391115 - syncProgress: 33 actual_era : mary - actual_epoch: 2858 - actual_block: 957501 - actual_slot : 20564587 - syncProgress: 37 actual_era : mary - actual_epoch: 3155 - actual_block: 1063253 - actual_slot : 22707147 - syncProgress: 40 === Waiting 60s before retrying to get the tip again - 0 !!!ERROR: command cardano-cli.exe query tip --testnet-magic 3 return with error (code 1): b'MuxError (MuxIOException withIODataPtr (readHandle): resource vanished (The pipe has been ended.)) "(readHandle errored)"\r\n\r\ncardano-cli.exe: MuxIOException withIODataPtr (readHandle): resource vanished (The pipe has been ended.) "(readHandle errored)"' === Waiting 60s before retrying to get the tip again - 1 !!!ERROR: command cardano-cli.exe query tip --testnet-magic 3 return with error (code 1): b'\r\ncardano-cli.exe: CreateFile "\\\\.\\pipe\\cardano-node": does not exist (The system cannot find the file specified.)' === Waiting 60s before retrying to get the tip again - 2 !!!ERROR: command cardano-cli.exe query tip --testnet-magic 3 return with error (code 1): b'\r\ncardano-cli.exe: CreateFile "\\\\.\\pipe\\cardano-node": does not exist (The system cannot find the file specified.)'

Expected behavior The node should successfully sync and query tip should not return errors.

System info (please complete the following information):

tbrahmbhatt commented 2 years ago

Hi, I can see this is assigned to @coot in the JIRA ticket, can someone please assign to him on GH too - I dont have the permissions

coot commented 2 years ago

@dorin100 does the node stops? Could make sure the following traces are turn on:

if they are where can I find a the log file?

dorin100 commented 2 years ago

@coot the node does not seem to stop.

The GH jobs that ran these tests can be found here (both are using node tag vasil-testnet-v1):

The logs for each job can be found at the above links in the Artifacts section.

All the jobs are run as default with:

TraceLocalHandshake = false
TraceErrorPolicy = true
TraceLocalErrorPolicy = true

Here all windows jobs are ok but we are using tag 1.34.1 (last released one) - https://github.com/input-output-hk/cardano-node-tests/actions/runs/2367105131

FYI: the below flags were not set for any of those sync tests (but only the windows ones are failing)

"TestEnableDevelopmentNetworkProtocols": true,
"TestEnableDevelopmentHardForkEras": true,
coot commented 2 years ago

~@dorin100 when I look at the recent logs node_logs_shelley_qa_windows-latest from this run. I can only see failing Plutus scripts~

I looked at the wrong logs.

coot commented 2 years ago

@dorin100 my Windows node synced, and I wasn't able to reproduce the issue: every 10s a cardano-cli was able to query the tip of the node. Could you please enable the following tracers and configure them with the given minimal severities:

Tracer Severity
TraceLocalMux Info
TraceLocalHandshake Info
TraceLocalErrorPolicy Debug
TraceErrorPolicy Debug
dorin100 commented 2 years ago

@coot - the Window job failed again but this time I used the above tracers. The results can be found here - https://github.com/input-output-hk/cardano-node-tests/actions/runs/2495501357 (look for the [node_logs_staging_windows-latest](https://github.com/input-output-hk/cardano-node-tests/suites/6926637872/artifacts/269779719))

dorin100 commented 2 years ago

this was reproducible in 1 out of 2 runs also with 1.35.0-rc3 --> job: https://github.com/input-output-hk/cardano-node-tests/actions/runs/2501998632

coot commented 2 years ago

just for completeness sake: by reading the logs I didn't found anything wrong with the node. A few seconds before cardano-node shutdown the it accepted the last local connection, while cardano-cli was failing to connect very 60s. We don't have time stamps when cardano-cli was trying to create these connections, but it seems as it's trying to do that after cardano-node terminated.

coot commented 2 years ago

We discovered:

DiffusionErrored (ExceptionInLinkedThread "ThreadId 34" <stderr>: hPutChar: resource vanished (Broken pipe)

which indicates that there's something wrong with the logging system.

dorin100 commented 2 years ago

The issue is not reproducing when we run the node with "TurnOnLogging": false

PS: the above error might indicate that the io error was propagated to the top level and the node shutdown (and it should not shut down).

The above error can be found in the logs here:

deepfire commented 2 years ago

So, this is legacy tracing system.

We can fix this, by migrating Daedalus to the new tracing system -- which I believe is ready to accomodate Daedalus.