atsign-foundation / at_client_sdk

The Dart implementation of atSDK used for implementing Atsign's technology into other software
https://pub.dev/publishers/atsign.org/packages
BSD 3-Clause "New" or "Revised" License
1.47k stars 32 forks source link

sync_multiple_client functional test is too flaky and fails often #1291

Closed murali-shris closed 2 months ago

murali-shris commented 3 months ago

Describe the bug

https://github.com/atsign-foundation/at_client_sdk/actions/runs/8644815412/attempts/1 Most of the time, this test fails during first run and passes on re-run In this sample run, the test failed twice and passed on 3rd run

Update: After Investigation, it was found that the flakiness was due to the test being unable to complete the initial sync. The test is attempting to complete the initial sync repeatedly until the test times out.

Steps to reproduce

Expected behavior

Additional context

The server-side behaviour was analyzed to identify the cause of sync failure. It was found that the scenario causing this issue is: 'public encryption being updated by both the isolates in the test within milliseconds of each other'. Two updates of the same key happening concurrently. Let me explain this through an example:

e.g. there are two updates of the public enc key by two different clients (call them update1 and update2). keystote.put() is called twice within milliseconds of each other, and both updates run parallelly. Assume 'update1 has created internal key 10' in commitLog/commitLogCacheMap and 'update2 has internal key 11'. Now in some cases internal key 11 is being committed first and when update2 reaches the stage of updating the commitLogCacheMap; the entry with commitId 11 is already in the cache map and update2 is overwriting that commitEntry with another commitEntry that has commitId 10.

Now, when the latest commitEntry for public encryption key should be 11 but commitLogCache only has an entry with commitId 10. And in this case the client would repeatedly try to sync commitEntry 11, but the cache map would not have that. This caused the client to get stuck.

Possible solution

srieteja commented 2 months ago

The cause for this flakiness has been identified in the CommitLogKeyStore, which has been explained in the description.Will be coming up with a fix to the server-side bug in PR86. Hence, carrying it forward.

gkc commented 2 months ago

Thanks @srieteja