atsign-foundation / at_client_sdk

The Dart implementation of atSDK used for implementing Atsign's technology into other software
https://pub.dev/publishers/atsign.org/packages
BSD 3-Clause "New" or "Revised" License
1.46k stars 31 forks source link

Unable to perform initial sync in busy network (read slow) conditions. #490

Closed cconstab closed 2 years ago

cconstab commented 2 years ago

Describe the bug If network between at_client and the secondary is busy/congested then bad things can happen.

To Reproduce Steps to reproduce the behavior:

  1. First run an app (I used sshnp)
  2. Then put load on the network (download a game !)
  3. And then delete ~/.sshnp
  4. Then try and send a sshnp command
  5. Sometimes the client errors an errors then crashes out as below or worse still hangs
  6. Once initial sync has been done things seem to behave OK even if congested network.

Expected behavior Either a clear error message and retries or protocol should handle thin pipes maybe detect the lower bandwidth and chunk/wait accordingly.

This is VERY important to IoT hence P2 in my mind

Error trace

cconstab@cally:~/Colin-snippets/ssh_control$ rm -rf ~/.sshnp/
cconstab@cally:~/Colin-snippets/ssh_control$ dart run bin/ssh_trigger.dart -f @lostblue64 -t @blueamateurbinding  --host cally.lan  -l 2001 -d iot_device -v 
initializing storage
INFO|2022-04-21 21:13:58.527265|HiveBase|commit_log_f8eaae02d2eec6a7eac8878c3720a084087492f38dec6497e84374a5649663b0 initialized successfully 

getHiveSecretFromFile no file found
INFO|2022-04-21 21:13:58.615963|HiveBase|f8eaae02d2eec6a7eac8878c3720a084087492f38dec6497e84374a5649663b0 initialized successfully 

INFO|2022-04-21 21:14:00.481447|Monitor|monitor started for @lostblue64 with last notification time: null 

INFO|2022-04-21 21:14:02.125748|AtLookup|auth success 

INFO|2022-04-21 21:14:02.193404| sshnp |Waiting for initial sync 

INFO|2022-04-21 21:14:12.593776|AtLookup|auth success 

INFO|2022-04-21 21:14:12.750882|SyncService|Returning the serverCommitId 8551 

INFO|2022-04-21 21:14:12.895854|SyncService|Returning the serverCommitId 8551 

SEVERE|2022-04-21 21:14:33.238536|AtLookup|Exception in sending to server, Exception: No response after 10000 millis from remote secondary 

SEVERE|2022-04-21 21:14:33.239649|AtLookup|Error in remote verb execution Exception: No response after 10000 millis from remote secondary 

SEVERE|2022-04-21 21:14:33.242616|SyncService|Exception in sync ba0f881b-42f6-407f-b05a-68fbe583c4c5. Reason AT0023: Timeout waiting for response 

SEVERE|2022-04-21 21:14:37.486017|AtLookup|Exception in sending to server, ErrorCode: AT0014 - Exception: Unexpected response found 

SEVERE|2022-04-21 21:14:37.487527|AtLookup|Error in remote verb execution ErrorCode: AT0014 - Exception: Unexpected response found 

SEVERE|2022-04-21 21:14:37.489129|SyncUtil|Exception occurred in processing stats verb AT0014 - Unknown AtClient exception 

INFO|2022-04-21 21:14:37.566212|SyncService|Returning the serverCommitId 8551 

INFO|2022-04-21 21:14:37.712848|SyncService|Returning the serverCommitId 8551 

SEVERE|2022-04-21 21:14:48.844866|AtLookup|Exception in sending to server, Exception: No response after 10000 millis from remote secondary 

SEVERE|2022-04-21 21:14:48.845551|AtLookup|Error in remote verb execution Exception: No response after 10000 millis from remote secondary 

SEVERE|2022-04-21 21:14:48.845940|SyncService|Exception in sync ba0f881b-42f6-407f-b05a-68fbe583c4c5. Reason AT0023: Timeout waiting for response 

SEVERE|2022-04-21 21:14:51.203853|AtLookup|Exception in sending to server, ErrorCode: AT0014 - Exception: Unexpected response found 

SEVERE|2022-04-21 21:14:51.205286|AtLookup|Error in remote verb execution ErrorCode: AT0014 - Exception: Unexpected response found 

SEVERE|2022-04-21 21:14:51.206067|SyncUtil|Exception occurred in processing stats verb AT0014 - Unknown AtClient exception 

SEVERE|2022-04-21 21:15:02.305869|AtLookup|Exception in sending to server, Exception: No response after 10000 millis from remote secondary 

SEVERE|2022-04-21 21:15:02.307325|AtLookup|Error in remote verb execution Exception: No response after 10000 millis from remote secondary 

SEVERE|2022-04-21 21:15:02.307865|SyncUtil|Exception occurred in processing stats verb AT0023 - Timeout waiting for response 

SEVERE|2022-04-21 21:15:03.879774|AtLookup|Exception in sending to server, ErrorCode: AT0014 - Exception: Unexpected response found 

SEVERE|2022-04-21 21:15:03.880650|AtLookup|Error in remote verb execution ErrorCode: AT0014 - Exception: Unexpected response found 

SEVERE|2022-04-21 21:15:03.881022|SyncUtil|Exception occurred in processing stats verb AT0014 - Unknown AtClient exception 

SEVERE|2022-04-21 21:15:05.091887|JsonDecode|Failed to decode jsonString : [{"id":"3","name":"lastCommitID","value":"8551"}]
@lostblue64@[{"id":"3","name":"lastCommitID","value":"8551"}] Error : FormatException: Unexpected character (at line 2, character 1)
@lostblue64@[{"id":"3","name":"lastCommitID","value":"8551"}]
^

INFO|2022-04-21 21:15:10.089837|SyncService|Returning the serverCommitId 8551 

INFO|2022-04-21 21:15:10.093898|SyncService|Returning the serverCommitId 8551 

outbound finish handler called
Unhandled exception:
SocketException: Connection reset by peer (OS Error: Connection reset by peer, errno = 104), address = aad7c760-29f9-5792-920f-9e1a298b6e59.swarm0001.atsign.zone, port = 56124

And a hung process (I had to hit cntrl C)

LOTS OF KEY ETC LOGGED cut off to save your eyes.... - colin
FsVkonmTPlzR9OAsfRXJ1zEN8Kg==","metadata":{"createdAt":"2021-03-09 00:44:41.695","updatedAt":"2021-03-09 00:44:41.695"},"commitId":55,"operation":"+"}] Error : FormatException: Unexpected character (at character 1)
data:[{"atKey":"public:signing_publickey@lostblue64","value":"MIIBIjANBgkqh...
^

SEVERE|2022-04-21 21:24:03.288905|JsonDecode|Failed to decode jsonString : [{"id":"3","name":"lastCommitID","value":"8552"}]
@lostblue64@[{"id":"3","name":"lastCommitID","value":"8552"}] Error : FormatException: Unexpected character (at line 2, character 1)
@lostblue64@[{"id":"3","name":"lastCommitID","value":"8552"}]
^

INFO|2022-04-21 21:24:05.069455|SyncService|Returning the serverCommitId 8552 

INFO|2022-04-21 21:24:05.191757|SyncService|Returning the serverCommitId 8552 

outbound finish handler called
^CINFO|2022-04-21 21:35:19.328088| sshnp |Tidying up files 

Then after I remove the network load things work fine

cconstab@cally:~/Colin-snippets/ssh_control$ rm -rf ~/.sshnp/
cconstab@cally:~/Colin-snippets/ssh_control$ dart run bin/ssh_trigger.dart -f @lostblue64 -t @blueamateurbinding  --host cally.lan  -l 2001 -d iot_device -v 
initializing storage
INFO|2022-04-21 21:35:56.981570|HiveBase|commit_log_f8eaae02d2eec6a7eac8878c3720a084087492f38dec6497e84374a5649663b0 initialized successfully 

getHiveSecretFromFile no file found
INFO|2022-04-21 21:35:57.050448|HiveBase|f8eaae02d2eec6a7eac8878c3720a084087492f38dec6497e84374a5649663b0 initialized successfully 

INFO|2022-04-21 21:35:58.725292|Monitor|monitor started for @lostblue64 with last notification time: null 

INFO|2022-04-21 21:36:00.155807|AtLookup|auth success 

INFO|2022-04-21 21:36:00.221971| sshnp |Waiting for initial sync 

INFO|2022-04-21 21:36:06.524759|AtLookup|auth success 

INFO|2022-04-21 21:36:06.659035|SyncService|Returning the serverCommitId 8553 

INFO|2022-04-21 21:36:06.802858|SyncService|Returning the serverCommitId 8553 

INFO|2022-04-21 21:36:13.287375|SyncService|Inside syncComplete  SyncRequestSource.app  : Closure: (dynamic) => void 

INFO|2022-04-21 21:36:13.288274|SyncService|Sending result to onDone callback 

INFO|2022-04-21 21:36:13.289451| sshnp |syncResult.syncStatus: SyncStatus.success 

INFO|2022-04-21 21:36:13.291381| sshnp |syncResult.lastSyncedOn 2022-04-21 21:36:13.287203Z 

INFO|2022-04-21 21:36:21.856411| sshnp |SUCCESS:id: 599b8898-4798-4d4c-b354-83c11412599e status: NotificationStatusEnum.delivered 

INFO|2022-04-21 21:36:26.336031| sshnp |SUCCESS:id: b53e928f-a4ab-4fdb-a4c8-3cfa8992650f status: NotificationStatusEnum.delivered 2001 22 cconstab cally.lan  

INFO|2022-04-21 21:36:26.338336| sshnp |Tidying up files 

ssh -p 2001 cconstab@localhost
cconstab@cally:~/Colin-snippets/ssh_control$ 

Were you using an @‎application when the bug was found?

Additional context Add any other context about the problem here.

cconstab commented 2 years ago

To create a bad network locally in the past i have used

https://m0n0.ch/wall/index.php

I will see if I can provide a cheatsheet to setting that up as a local VM for whoever wants to look at this tricky issue

sitaram-kalluri commented 2 years ago

Made progress on this issue:

  1. Under slow network conditions, intermittently the _syncInProgress remains in true state even after the AtTimeoutException is thrown by the server. This leads to a new sync process not syncing keys (since the _syncInProgress remained in true, it assumes that previous sync is still running).

Attaching the error snippet

OutboundMessageListener._read (outbound_message_listener.dart:66)
OutboundMessageListener._read.<anonymous closure> (outbound_message_listener.dart:82)
_rootRunUnary (zone.dart:1434)
<asynchronous gap>
AtLookupImpl._process (at_lookup_impl.dart:457)
<asynchronous gap>
AtLookupImpl._stats (at_lookup_impl.dart:320)
<asynchronous gap>
AtLookupImpl.executeVerb (at_lookup_impl.dart:246)
<asynchronous gap>
RemoteSecondary.executeVerb (remote_secondary.dart:39)
<asynchronous gap>
SyncUtil.getLatestServerCommitId (sync_util.dart:88)
<asynchronous gap>
SyncServiceImpl._getServerCommitId (sync_service_impl.dart:544)
<asynchronous gap>
SyncServiceImpl._isInSync (sync_service_impl.dart:525)
<asynchronous gap>
SyncServiceImpl._processSyncRequests (sync_service_impl.dart:164)
<asynchronous gap>
SyncServiceImpl._scheduleSyncRun.<anonymous closure> (sync_service_impl.dart:80)
<asynchronous gap>
_ScheduledTask._run.<anonymous closure> (cron.dart:1)
<asynchronous gap>
sitaram-kalluri commented 2 years ago

Spent 5 SP in PR-40

sitaram-kalluri commented 2 years ago

Enhanced the outbound message listener to handle partial responses and responses sent in multiple packets under slow network conditions. PR-185. Burned 3 SP for this.

sitaram-kalluri commented 2 years ago

Total 8SP burned in PR-40

sitaram-kalluri commented 2 years ago

The work is getting tracked in the following Git issue: https://github.com/atsign-foundation/apps/issues/552