demonstration of the advantage

arBmind commented 7 years ago

I just viewed your lightning talk at Meeting C++ 2016 on Youtube. This sounded very familiar. We experimented with persistent data structures a couple of years ago. Things went well until, we tried to demonstrate the advantages in a real use case. We simply could not find any use case, where persistent data structures were able to outperform a more simple implementation. This does not mean they don't exist.

I'm curious if you have found a practical example.

In theory there should be some benefits, but it would help to convince me/us with a practical application.

Single threaded applications are much faster with std::vector or std. associative containers with tweaked allocators. Even for multithreaded applications it is most often faster to duplicate the vector content. Copy is quite fast and all changes will be thread local. The other issue is synchronization with persistent data structures, you easily end up with a version for each thread. Most often this is not what was intended. So most of our applications needed a single threaded repository that serializes all changes and manages integrity.

Sorry for being a bit pessimistic here. I hope you can prove me wrong.

arximboldi commented 7 years ago

My practical experience with immutable data structures draws a lot from ClojureScript and React (via Reagent[1]). In this language, immutable data structures are the default. Note that this is a single-threaded system, but fast comparison (via hash-caching and identity implying equality) make React really shine. This enables writing the UI as a simple pure function and have the infrastructure optimize it, deciding which parts need to be changed. It also enables having the undo history simply be a collection of the previous states. I concede that no framework like React exists for C++ but I envision that it might exist in the future. In the past I wrote a library atria::funken working towards that direction [0]

You say that:

Even for multithreaded applications it is most often faster to duplicate the vector content

Duplicating the vector content is fast enough if your vectors are small enough or you do the copies not so often or without deadlines. But when we are talking about the data model of some large enough application, this is simply too much. Specially, if you have to do it for every frame that is rendering (you want to keep this updates way below 16ms). One of the applications that I am most interested in are Digital Audio Workstations, like Ableton Live [2] (disclaimer, I used to work for Ableton). There, a document often has has hundreds of "tracks" containing hundreds of thousands of "clips" that can contain hundreds of audio samples and hundreds of thousands of notes and automation points. Imagining copying all this data whenever the user performs an action! As I suggest in the talk, with immutable data structures, it is feasible to not implement an undo history via a command pattern, but just keep the previous versions of the document around. Sean Parent is a well-known advocate for this approach and he claims it is the basis of undo in Photoshop [3]

Another interesting use-case is parallelization. RRB-Trees [4] allow logarithmic concatenation and slicing. There is a whole paralelization framework built in Scala around them [5]. This data structure enables parallelizing transformations that change the number of elements in the collection, like filtering.

The other issue is synchronization with persistent data structures, you easily end up with a version for each thread. Most often this is not what was intended. [...] So most of our applications needed a single threaded repository that serializes all changes and manages integrity.

For cases where other threads just "consume" the data (like the views as UI, audio engine, storage examples in the talk), this is what intended. But I understand that sometimes one may want to perform multiple changes concurrently, then merge the changes back to a single source of truth (from what I understand, this was your use-case). I agree that this is an interesting problem. I do not thing that mutable data structures help in solving it though. I plan to add operations to efficiently diff data structures, generating a patch value that can be applied to a data structure. This can help solve this problem, and also be a foundation for building collaborative systems (e.g. Google Docs like applications).

I concede that the documentation does not really elaborate in this direction. Understand that this is work in progress. Once the core functionality is covered I would like to add tutorial-like examples on how to, e.g. build an interactive app using these. I hope that this response provided at least some insight on why building this library is a worthy effort.

[0] https://github.com/Ableton/atria [1] https://github.com/reagent-project/reagent [2] http://ableton.com/ [3] https://www.youtube.com/watch?v=bIhUE5uUFOA [4] https://infoscience.epfl.ch/record/169879/files/RMTrees.pdf [5] https://infoscience.epfl.ch/record/150220/files/pc.pdf

arBmind commented 7 years ago

@arximboldi Thank you for your long and good answer.

My intention is to trigger the creation of one or more good examples. This should help to convince me and others to invest more time into this topic. If you do give me a ping and I will try to challenge your design. No matter how this might turn out, we will learn a lot.

I know it from my own experience: Most of the efforts go to the implementation, some into tests but far too little into convincing examples.

arximboldi commented 7 years ago

Yeah, I will leave this open until I have some satisfactory examples. I'll ping you here then so you can provide your feedback. Thanks!

arximboldi commented 7 years ago

Hi @arBmind!

Some things have happened since you opened this issue:

@kostspielig and I wrote this text-editor called Ewig using the library, that is meant as didactic material to show the architectural patterns that this library enables: https://github.com/arximboldi/ewig
I talked at CppNow about the library going a bit deeper into the architectural issues and discussing the design and implementation of Ewig: https://www.youtube.com/watch?v=ZsryQp0UAC8

I will add links to those in the README. Woudl you consider this addresses your issue?

In the mid-term I would like to turn Ewig into a tutorial and add it directly to the documentation, but I am unsure as to when will I have time to do this (I am looking for sponsors so I can buy more time to dedicate full-time into this wink wink 😉 )

arBmind commented 7 years ago

Hi @arximboldi !

I already watched your video. Very good talk again. The text editor is really impressive, especially the part where the content is bigger than your RAM.

So my takeaway is, that you can build applications that are very easy to reason about (basically stupid code) that still works by the magic of persistent containers.

On the other hand, I still believe if you would put more time and effort into optimizing the architecture, you could get better results even without using persistent data structures. There are text editors that allow you to open and edit files that are much larger than your RAM.

Examples:

https://www.emeditor.com - They claim files up to 248GB
http://www.pilotedit.com - 400GB see http://www.pilotedit.com/images/OpenInQuickMode.PNG

Sorry, that I am so hard to convince. We were there and tried really hard to prove that persistent data structures have a place. I still think they have a niche, but it seems to be very small.

Keep up the good work!

arximboldi commented 7 years ago

Thanks for the feedback!

I agree that persistent data structures are not a silver bullet--indeed recently I found myself trying to convince an overexicted potential user to not use them 😉

To me there is no doubt that, technically, one can always achieve better performance by removing abstraction (one could argue the same about virtual memory, TCP, compilers, etc.). But removing abstraction soon hits the law of diminishing returns, where the effort to build something becomes bigger and bigger but the potential gains become smaller and smaller. Even worse, new system-wide optimizations become impossible, because the issue at hand is spread across the whole codebase, as opposed to abstracted in a single unit.

A good example of performance via abstraction is the success of React.js: by writing UI's at a higher level of abstraction, low level UI state management becomes an implementation detail, and whole system optimizations can be implemented by improving the diffing algorithm or changing the evaluation model (see the new React Fiber). Interestingly, immutable data structures work particularly well with React because they support "reasoning about change", enabling more aggressive optimizations.

Note that the selling point of immutable data-structures is not best performance, but good performance combined with simplicity, in particular in the domains of concurrency and interactivity. There are lots of systems developed in Clojure and Scala for this reason, so I do not believe this is a "small niche" 😄

With regard to the text editors: thanks for the links, those editors are very interesting! I will see if I can try them before my next version of the talk (I am not a Windows user). But note I don't claim that Ewig is the best at editing huge files: the notable thing is that but that even though it just took a couple of weekends to write it works quite decently [1]! The devil is in the trade-offs 😄

Anyways, this was an estimulating discussion, thanks a lot! I think now we are getting to splitting hairs about the surface-size of the use-case, so I am going to close the issue. Still, feel free to reach to me again whenever you find new challenging examples.

[1] I did test Gedit, Emacs, Vim, Sublime, VS Code, Atom and Clion, and most are unable, some just slow, to do big copy-paste on big files ;-) But of course, Ewig has less features, so it is not really that comparable.

arximboldi / immer

demonstration of the advantage #2