enso-org / enso

Enso Analytics is a self-service data prep and analysis platform designed for data teams.
https://ensoanalytics.com
Apache License 2.0
7.38k stars 323 forks source link

Race in application of pending edits leads to stability issues #8770

Closed hubertp closed 9 months ago

hubertp commented 10 months ago

Individual edit requests are submitted by GUI client and served by ApplyEditHandler while preserving their order.

The actual application of pending edits is delayed until EditFileCmd is executed (as submitted by collaborative buffer). There we enqueue and trigger a compilation. The latter will eventually attempt to get all edits, apply them and execute the compilation.

We were seeing numerous stability issues, like #8174, #7978 and plenty of others that have not been reported, which we were never able to reproduce and fix properly.

After adding some debugging statements I'm fairly sure that what we are seeing is a race-condition in the execution of EditFileCmds. While they are submitted to the executor in the same order as the client requests, nothing really guarantees that the executor will run them in that same order. Most of the time that will be the case, or otherwise one would not be able to use the IDE, but occasionally one of them will execute out-of-order leading to immediate synchronization issues. Old IDE would reopen the file on such failure but a) it stopped doing that b) it's a hack leading to data loss.

The solution is to ensure that all EditFileCmd commands are executed sequentially and in the order of submission. The latter is already ensured with the appropriate use of locks.

enso-bot[bot] commented 10 months ago

Hubert Plociniczak reports a new STANDUP for yesterday (2024-01-15):

Progress: Trying to come up with a small reproducible case for the truffle assertion failure that manifested itself in #8595. The equality check seems to be dependent on the order of specializations making it hard. The wrong definition for Nothing with Warnings will be addressed separate, as discussed with Pavel. After looking into reported logs I think I finally came up with a scenario that leads to ongoing stability issues. Will need to specialize some commands to execute sequentially and in order of submission (currently not guaranteed for edits). It should be finished by 2024-01-17.

Next Day: Next day I will be working on the #8770 task. Continue investigating the issue

hubertp commented 10 months ago

Rather than adding another pool for executing yet another synchronous commands I think we should generalize the approach and allow for executing sequentially any tasks grouped under the same key.

enso-bot[bot] commented 10 months ago

Hubert Plociniczak reports a new STANDUP for yesterday (2024-01-16):

Progress: Reviewing #8712. Trying to come up with a generic solution for executing (some) tasks sequentially for #8770. It should be finished by 2024-01-17.

Next Day: Next day I will be working on the #8770 task. Continue investigating the issue

enso-bot[bot] commented 10 months ago

Hubert Plociniczak reports a new STANDUP for yesterday (2024-01-17):

Progress: Provided a temporary workaround that fixes most of stability issues. Still trying to figure a more generic solution. Stumbled upon some bugs (#8793, #8792) during testing and verifying logs submitted by the users. Having a hard time coming up with a unit test case. It should be finished by 2024-01-17.

Next Day: Next day I will be working on the #8770 task. Investigating a generic solution + test case.

enso-bot[bot] commented 10 months ago

Hubert Plociniczak reports a new 🔴 DELAY for yesterday (2024-01-18):

Summary: There is 2 days delay in implementation of the Race in application of pending edits leads to stability issues (#8770) task. It will cause 0 days delay for the delivery of this weekly plan.

Delay Cause: Unexpected time off

enso-bot[bot] commented 10 months ago

Hubert Plociniczak reports a new STANDUP for yesterday (2024-01-18):

Progress: Continued working on a test case. It should be finished by 2024-01-19.

Next Day: Next day I will be working on the #8770 task. Still investigating

enso-bot[bot] commented 10 months ago

Hubert Plociniczak reports a new STANDUP for the provided date (2024-01-19):

Progress: Struggled with JPMS to allow for mocking elements of our infrastructure. With help from Pavel we were able to get it working. Looked into random timeouts reported for engine tests (#8806); will need proper investigation on reducing resources used by zio. It should be finished by 2024-01-19.

Next Day: Next day I will be working on the #8770 task. Address PR review. Pick up next item.