Closed murali-shris closed 2 months ago
The cause for this flakiness has been identified in the CommitLogKeyStore, which has been explained in the description.Will be coming up with a fix to the server-side bug in PR86. Hence, carrying it forward.
Thanks @srieteja
Describe the bug
https://github.com/atsign-foundation/at_client_sdk/actions/runs/8644815412/attempts/1 Most of the time, this test fails during first run and passes on re-run In this sample run, the test failed twice and passed on 3rd run
Update: After Investigation, it was found that the flakiness was due to the test being unable to complete the initial sync. The test is attempting to complete the initial sync repeatedly until the test times out.
Steps to reproduce
Expected behavior
Additional context
The server-side behaviour was analyzed to identify the cause of sync failure. It was found that the scenario causing this issue is: 'public encryption being updated by both the isolates in the test within milliseconds of each other'. Two updates of the same key happening concurrently. Let me explain this through an example:
e.g. there are two updates of the public enc key by two different clients (call them update1 and update2). keystote.put() is called twice within milliseconds of each other, and both updates run parallelly. Assume 'update1 has created internal key 10' in commitLog/commitLogCacheMap and 'update2 has internal key 11'. Now in some cases internal key 11 is being committed first and when update2 reaches the stage of updating the commitLogCacheMap; the entry with commitId 11 is already in the cache map and update2 is overwriting that commitEntry with another commitEntry that has commitId 10.
Now, when the latest commitEntry for public encryption key should be 11 but commitLogCache only has an entry with commitId 10. And in this case the client would repeatedly try to sync commitEntry 11, but the cache map would not have that. This caused the client to get stuck.
Possible solution