Support backend workflow version control for async multi-user scenario

Texera / texera

Collaborative Machine-Learning-Centric Data Analytics Using Workflows

https://texera.github.io

Apache License 2.0

160 stars 68 forks source link

Support backend workflow version control for async multi-user scenario #1241

Closed Rkyzzy closed 2 years ago

Rkyzzy commented 3 years ago

Currently we can only perform the undo redo service on the frontend and it is cached only in the browser, which means that after closing the browser or refreshing it the previous saved version of workflow will disappear. In the backend, it uses the endpoint PersistWorkflow to update the workflow, for auto-save or saving by the user, it only does the replacing/updating in the backend, we will lose previously saved version and cannot restore, we want it to be changed to having multiple version of previous workflow stored and have certain kind of version control over merging these workflow histories when it comes to an async multi-user scenario Todo is to first explore the easier case, which is the system without user system--One user, Sequential. Then, Some research should be conducted for similar cases of the harder async multi-user scenario. Similar implementation like Google Doc can be investigated for this purpose. Progress will be posted under this issue.

versioned autosave workflow

Autosaving a workflow while maintaining the previous versions means this setting

The change is frequent (cache intermediate actions)
The difference is relatively small (compute the difference)
Persist the difference (write the difference)
List versions (list the number of versions)
Retrieve a version (apply patch)

MLflow tracking is an example application that uses Git to version a model. Since our workflow change is frequent and the underling representation is not code, this doesn't directly apply to us.
Automatiko and Temporal.io do workflow version control but the setting is different as the assumption is the version is not frequent or how to apply the new version to an ongoing execution of the workflow.

Next step is to follow the 5 steps above to do the version control. similar to Google's Autosave and Chrome's Autosave. Since our workflow is internally represented as a JSON object. Tools to compute the difference and apply patches are listed below.

Rkyzzy commented 3 years ago

https://www.figma.com/blog/how-figmas-multiplayer-technology-works/ https://www.figma.com/blog/behind-the-feature-autosave/ A company called Figma has its CTO and engineering posting these two blogs that states the problem clearly, and their app's collaborative property is very much the same with Texera, I read all these two articles , they explain and Problem pretty well with vivid video and stuff, but I don't actually understand its final solution for the problem, a little bit vague. Think I need to revisit and try to understand it and find out whether it is suitable for our system.

Rkyzzy commented 3 years ago

https://en.wikipedia.org/wiki/Operational_transformation#Critique_of_OT https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type#G-Counter_(Grow-only_Counter) It seems that there are currently two possible solution for this scenario Two wikipedia links regarding to this topic, read part of it, still investigating ,found many useful paper links under these links

Operational transformation (OT) is a technology for supporting a range of collaboration functionalities in advanced collaborative software systems. OT was originally invented for consistency maintenance and concurrency control in collaborative editing of plain text documents. Its capabilities have been extended and its applications expanded to include group undo, locking, conflict resolution, operation notification and compression, group-awareness, HTML/XML and tree-structured document editing, collaborative office productivity tools, application-sharing, and collaborative computer-aided media design tools. In 2009 OT was adopted as a core technique behind the collaboration features in Apache Wave and Google Docs.

A conflict-free replicated data type (CRDT) (For distributed system) is a data structure which can be replicated across multiple computers in a network, where the replicas can be updated independently and concurrently without coordination between the replicas, and where it is always mathematically possible to resolve inconsistencies that might come up

Rkyzzy commented 3 years ago

https://www.youtube.com/watch?v=3ykZYKCK7AM Watched this short talk of engineer of Google Wave, it seems that for a server-client model, to solve the issue we are having, it needs Operational transformation on the server, so need to investigate more on the OT and its relevant algorithm.

Rkyzzy commented 3 years ago

TODO: Watch these two Google Tech Talk 1.Issues and Experiences in Designing Real-time Collaborative Editing Systems https://www.youtube.com/watch?v=84zqbXUQIHc 2.Differential Synchronization https://www.youtube.com/watch?v=S2Hp_1jqpY8

Rkyzzy commented 3 years ago

TODO: https://research.google.com/pubs/archive/35605.pdf Read Google's paper about Differential Synchronization

Rkyzzy commented 3 years ago

https://softwareengineering.stackexchange.com/questions/202815/how-to-save-during-real-time-collaboration Very useful stackexchange question regarding this issue: It suggests two way of implementing this:

Use Google' realtime API which has OT implemented inside.
General Solution: 1) let server be the multiplexer , it always has the most up-to-date view of the document. _2)_For conflict resolving: Find a third party algorithm/module, doing this alone is tough , if can't use third party algorithm then simply prompt the user. _3)_When a new user joins, give them the most recent document and automatically start streaming the commands to them. The server has the most recent view and thus can dish it out automatically. _4)_Backup to the database at certain intervals. Decide how often you want to back up (every 5 minutes or maybe every 50 changes.) This allows you to maintain the backup you desire. Drawbacks: _1)_Throughput of the server could bottleneck performance _2)_Too many people reading/writing could overload the server _3)_People may become out of sync if a message is lost, so you may want to make sure you synchronize at regular points. This means sending out the whole message again, which can be costly, but otherwise people might not have the same document and not know it.

Rkyzzy commented 3 years ago

TODO: Implement single user saving strategy and restoring operation.

Rkyzzy commented 3 years ago

Libraries available regarding this issue will be posted following and be updated for further evaluation.

Rkyzzy commented 3 years ago

Library1 GSON (UPDATING) Link to library: https://github.com/google/gson, documentation

Provider: Google

Stars: 19.9k

Maintaining status: Still updating and maintaining

Brief introduction: It converts Java objects into JSON.

How it help our task :It does the versioning using @Since annotation that it customized on Classes, Fields and, in a future release, Methods, etc.

Advantage:

Disadvantage: functionality not clear(e.x. We don't know whether it can handle nested json object)

Demo usage and Expected output: can be found here demo usage, tutorial ......

Rkyzzy commented 3 years ago

Library2 Json-Version-Control (UPDATING) Link to library: https://github.com/datoMarjanidze/json-version-control

Provider: datoMarjanidze

Stars: 3

Maintaining status: Stop updating since 3 years ago

Brief introduction: It's a small npm library for json version control. It provides solution to two task: 1) Store updated information 2) Catch differences (propertie creation, deletion & value modificataions).

How it help our task: It provides two key function to do the job. 1)createHistoryObject which takes in two parameters(versionNumber (Number) , predecessorObject (Object) , currentObject (Object)) 2) restoreHistoryObject which takes in three parameters ( historyObjects (Array) , currentObject (Object) , options (Object) ) details can be checked at its usage

Operation Complexity: To be analyzed.

Advantage: It really satisfy our need: it does the diff versioning and it calculates the deep difference of (nested) json object which most libraries cannot do. In case of restoring(checkout), it has the spec like this which satisfy our user need.

Disadvantage: Whether we can trust the library is doubtful. The library has little stars on github, and there is no maintaining and updating since three years ago. It needs to be verified.

================================================================================== Testing : Finished a basic trial using it over our workflow, data is a real workflow in a production environment, changes like moving operators' position, adding/deleting an operators, change operator properties, change linking status between operators are tested. Hardcoded 8 demo workflows with these gradual changes.(Same test as for library 3)

Result: It can correctly perform the operation we want, that is, commit a version(using its createHistoryObject to perform diff and store it) , list all the version, checkout to a specific version (using its restoreHistoryObject).

Drawback: 1) It has the problem library3 has, which is the nested array problem, what is even worse is that, for some unknown reason, for a modification in our workflow, even if I didn't modify the operatorPositions part and breakpoints part, these two will always be stored into a change.(which I haven't figured out why, maybe is the library's implementation issue) 2)its restoreHistoryObject only provide "merge-until-reach-the-oldest-version" feature, to get a specific certain version we want, we need to manually get the portion of the changes that we want it to be merged at.

Comments: A general comments is that, it can does the work, but it has the same problem as library 3 and even worse, combining the repo's activeness (3 stars and not updating), I suggest we do not use this as our versioning library.

......

Rkyzzy commented 3 years ago

Library3 node-rus-diff (UPDATING)

Link to library: https://github.com/mirek/node-rus-diff

Provider:mirek

Stars: 116

Maintaining status: Last update on Oct. 2020

Brief introduction: (R)emove-(U)pdate-(S)et JSON diff library can be used standalone to compute difference between two JSON objects.

How it help our task: It provides the tool for comparison and diff between json object, which is a key step when we do the diff storing. (It contains functionality that is like a subset of the above library, the diff part) . The diff and apply operation example of this library can be checked here.

Operation Complexity: To be analyzed.

Advantage: It is a well wrapped library tool for json diff which is the key step for our task, and it is verified by many people.

Disadvantage: It has three remaining issue as the developer suggested: 1) It will not dive into nested arrays; 2) Whether array will be compared as ordered or unordered set is hard to specify. 3) The code is written in dated coffee script, which is not as good as a ts implementation

==============================================================================

Testing: Finished a basic trial using it over our workflow, data is a real workflow in a production environment, changes like moving operators' position, adding/deleting an operators, change operator properties, change linking status between operators are tested. Hardcoded 8 demo workflows with these gradual changes.

Result: It can correctly perform the operation we want, that is, commit a version(perform diff and store it) , list all the version, checkout to a specific version (apply changes to the latest version gradually backwards). Correctness guaranteed by assert.deepStrictEqual

Some drawbacks I found after testing : 1. When it comes nested array in our workflow, for example, the "operators" part, its way of storing diff is store the entire array(It takes the array as a whole, even if a tiny bit part in the array changes, it will still store the whole array), which is not that optimal. 2. As the Disadvantage(2) mentioned, it has trouble whether it should compare the array as a ordered or an unordered set. I'm not quite sure whether Texera's workflow's nested array part's order matters or not.

...

Rkyzzy commented 3 years ago

Library4 jsondiffpatch

Link to library: https://github.com/benjamine/jsondiffpatch

Provider: benjamine

Stars: 3.8k

Maintaining status: Still maintaining

Brief introduction: A javascript library that can perform json's diff/patch/unpatch operation

How it help our task : It can help our task as the above library do, do json diff to produce the delta change of the workflow to save and patch and unpatch to restore and check out to certain version.

==============================================================================

Testing: Finished a trial using it over our workflow, data is a real workflow in a production environment, changes like moving operators' position, adding/deleting an operators, change operator properties, change linking status between operators are tested. Hardcoded 8 demo workflows with these gradual changes.

Result: For correctness, it performs good. It can perform diff over two workflow in a reasonable way, also it has the function of bothpatch() and unpatch() to either apply the delta forward or backward. For performance evaluation, its way of storing the changes outperforms the above libraries( json-version-control and node-rus-diff ) because it can handle the nested array well that it won't store unnecessary information (For example, same case of changing a workflow operator's property, it will only store the changed operator instead of the whole list of all operators) and the format of its changes is pretty clear to me. It also provides some utility function such as deepclone of a json object.

The format of changes it stores: This is the delta produced after modification of the 'limit' property of the limit operator from 2 to 3. 1629613978(1)

For more trial, it has a live demo online that you can try on here

Short comment about this library: This library is worth trying in my opinion, it can fulfill our requirement and guarantee both correctness and performance, also it has support and its license is MIT license.

Rkyzzy commented 3 years ago

Recent work regarding comparison of different storing strategy, basic implementation design, and library evaluation and its performance test: versioning_design_and_testing.pptx

Rkyzzy commented 3 years ago

Have some frontend change for the versioning part

add a button in the navigation bar to show/hide version history table (user can also click the metadata to perform the show/hide)
popluate all the version log history into a table in the right of the canvas

zuozhiw commented 2 years ago

Discussion 1/13: closed it as we implemented it for a single user case for multiple versions.