arximboldi / immer

Postmodern immutable and persistent data structures for C++ — value semantics at scale
https://sinusoid.es/immer
Boost Software License 1.0
2.48k stars 177 forks source link

Status of diff & patch #58

Closed bobkocisko closed 5 years ago

bobkocisko commented 5 years ago

Juanpe,

I'm currently considering options to simplify the application state management in a large app. So far the whole app has been coded imperatively and it gets somewhat tricky because one requirement we have is to track and send updates externally of only the state that has changed. It's becoming somewhat unmanageable as more new global features are being added...I guess we're just reaching the point of diminishing returns with the current approach. So I started looking into functional programming techniques which led me to immutable data structures and I came across your library. This is exciting stuff. One thing that is holding me back though from testing immer is the lack of diff / patch support. Since high performance is also a requirement but since we need to be very aware of updating only what has changed, support for efficient diffing would be essential. What is the status of this? And what is the level of difficulty to implement it?

arximboldi commented 5 years ago

Hi!

Awesome that you are looking into using Immer for your project!

The diff/patch I haven't started working on, and I doubt I will have time to work on that until next year, unless I get some generous sponsor :)

On the other hand, I wonder whether diff/patch is really needed for your application. You can do ad-hoc diff/patch by comparing the nodes in your data-model tree (Immer already supports efficient == and !=) but of course it depends on the shape of your data and the size of typical documents. If you give me some more details I can give you better hints on what the best approach might be to integrate Immer.

Cheers!

JP

bobkocisko notifications@github.com writes:

Juanpe,

I'm currently considering options to simplify the application state management in a large app. So far the whole app has been coded imperatively and it gets somewhat tricky because one requirement we have is to track and send updates externally of only the state that has changed. It's becoming somewhat unmanageable as more new global features are being added...I guess we're just reaching the point of diminishing returns with the current approach. So I started looking into functional programming techniques which led me to immutable data structures and I came across your library. This is exciting stuff. One thing that is holding me back though from testing immer is the lack of diff / patch support. Since high performance is also a requirement but since we need to be very aware of updating only what has changed, support for efficient diffing would be essential. What is the status of this? And what is the level of difficulty to implement it?

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/arximboldi/immer/issues/58

-- ∿∿∿ https://sinusoid.al ∿∿∿ https://sinusoid.es

bobkocisko commented 5 years ago

JP,

Thanks! The app is basically MVC but the model also represents external device state which must be kept in sync. We need to support up to 250 external devices, each of which has about 10-300 settings. When a command is fired from the UI it needs to map to about 1-50 individual settings changes as well as some app state changes and many of these are asynchronous. Each setting has to be successfully sent and acknowledged by the external device(s). We must accurately track each device's current settings in memory because if we try to set something but it was already at that value, we get no acknowledgment from the device and have to assume something is wrong. And eventually we would like to have this all work transactionally so that if any of the settings requests fail that the others are rolled back to return to consistency. Additionally, each command needs to be able to be run in parallel with others; we cannot force them to run sequentially since they can be delayed with the network round-trips and multiple external APIs that we must update at the same time.

So far everything is working pretty well with some annoying caveats (annoying to a software architect, not so much the users). But new requirements are coming all the time and making the software harder and harder to reason with. I think that functional programming techniques like map/reduce/filter and immutable data structures would dramatically simplify the code and lessen brain fatigue :). But I don't want to lose performance in the meantime. The app is extremely performant right now and we want to keep it that way. Besides using immer for storing the state of the app, are there any optimized libraries for implementing map/reduce/filter in a performant way in C++, similar to what GHC does for Haskell (https://stackoverflow.com/a/35038374/4307047)?

One reason why I'm asking about performant diffing / patching is because I could see that as a way to answer some of these problems in a more encapsulated way and to help with merging the state of asynchronous commands into the rest of the application state. Another reason is that we have hundreds of controls in our UI and we have specific algorithms and signal/slot mechanisms to make sure that only the minimum number of OpenGL draw calls are run to update only the control(s) that have changed rather than re-drawing the whole screen. Our app runs on limited hardware so the user experience would be pretty bad and the processor would be running much hotter if we had to redraw everything all the time. I think that perhaps transitioning into a map/reduce architecture and hooking the UI updating signal/slot into the end of that pipeline might be an answer here but I'm only speculating.

Also, do you have any recommendations for maintaining pointers between nodes in an immutable document, which stay up-to-date across changes? The current design of our app does this everywhere. But every time I try to think about applying that to an immutable structure, my brain goes into an infinite loop. :) I suspect the only answer would be to store the identity or path from the document root and then re-traverse that path after every update to find the latest changes. Am I missing an easier/faster way?

Bob

arximboldi commented 5 years ago

Hi!

Greetings from my layover flight on my way from CppCon... Sorry for the delayed reply, I needed to give it some thinking and I was busy travelling and coferencing :-)

Lots of interesting points there. Let me try answer one by one:

Thanks! The app is basically MVC but the model also represents external device state which must be kept in sync. We need to support up to 250 external devices, each of which has about 10-300 settings.

Are the settings dynamic (stored in a hash-map or alike), or static (just structs)?

Either way, that should be easily manageable with an immutable approach.

When a command is fired from the UI it needs to map to about 1-50 individual settings changes as well as some app state changes and many of these are asynchronous. Each setting has to be successfully sent and acknowledged by the external device(s). We must accurately track each device's current settings in memory because if we try to set something but it was already at that value, we get no acknowledgment from the device and have to assume something is wrong. And eventually we would like to have this all work transactionally so that if any of the settings requests fail that the others are rolled back to return to consistency. Additionally, each command needs to be able to be run in parallel with others; we cannot force them to run sequentially since they can be delayed with the network round-trips and multiple external APIs that we must update at the same time.

As you guessed, this sounds like a very good use-case for immutable data-structures, since launching asynchronous operations or keeping old values around is trivial and extremelly performant (no locking or copying data needed for either case). This means that concurrency and transactions come basically "for free". There is a performance penalty when updating the data-structure, but it is still reasonable, and probably compensated by the improvements in the other parts.

So far everything is working pretty well with some annoying caveats (annoying to a software architect, not so much the users). But new requirements are coming all the time and making the software harder and harder to reason with. I think that functional programming techniques like map/reduce/filter and immutable data structures would dramatically simplify the code and lessen brain fatigue :). But I don't want to lose performance in the meantime. The app is extremely performant right now and we want to keep it that way.

I agree!

Besides using immer for storing the state of the app, are there any optimized libraries for implementing map/reduce/filter in a performant way in C++, similar to what GHC does for Haskell (https://stackoverflow.com/a/35038374/4307047)?

It depends on what you want to achieve with map/filter etc. Ranges-v3 would be something recommendable in this scenario. You may as well take a look at the work I did on "transducers" a few years ago: https://www.youtube.com/watch?v=vohGJjGxtJQ. In either case, it feels to me that finding the right data-model (based around immutable data) is a more important challenge for you right now, since it is the most architecturally-relevant change. Having good map/filter like API's is more of a nice to have thing. Also note that a lot of things are just different in Haskell due to lazy evaluation (and not necessarily faster!) which fundamentally changes the way to think about recursion.

One reason why I'm asking about performant diffing / patching is because I could see that as a way to answer some of these problems in a more encapsulated way and to help with merging the state of asynchronous commands into the rest of the application state.

The merging you can already do by carefully using the API provided by Immer to make sure updates are done in the most efficient way (e.g. avoid unnecessary allocations etc.) If you are going to need immer::map or immer::set, note that "transients" are not yet supported (but they are in the road-map). This is an optimization that would allow you to gain some extra cycles, depending on how your data is structured. If this eventually becomes critical for you, please consider sponsoring that work!

Another reason is that we have hundreds of controls in our UI and we have specific algorithms and signal/slot mechanisms to make sure that only the minimum number of OpenGL draw calls are run to update only the control(s) that have changed rather than re-drawing the whole screen.

This is the part where diffing could be useful, to reconstruct the set of changes from an mixed set of updates. There are some ways you can do this without native immer support, and depending on your data, it might be enough. Another technique you can used to is to provide "hints" through your event paths, that preserve information about the nature or scope of changes that occurred, such that UI updates can be done more efficiently.

Our app runs on limited hardware so the user experience would be pretty bad and the processor would be running much hotter if we had to redraw everything all the time. I think that perhaps transitioning into a map/reduce architecture and hooking the UI updating signal/slot into the end of that pipeline might be an answer here but I'm only speculating.

It is hard for me to provide you very specific performance guesses without knowing the details of the application or the hardware. With simple immutable data the data is often flatter and more compact than with observable object trees (QObject and equivalent) allowing much faster processing. Plus you can remove a lot of signal/slot firing, and almost all mutex locking, highly improving latency and time wasted in coordination. In the end, the only solution is really to prototype and meassure.

Also, do you have any recommendations for maintaining pointers between nodes in an immutable document, which stay up-to-date across changes? The current design of our app does this everywhere. But every time I try to think about applying that to an immutable structure, my brain goes into an infinite loop. :) I suspect the only answer would be to store the identity or path from the document root and then re-traverse that path after every update to find the latest changes. Am I missing an easier/faster way?

As you guessed, you can't do that. That is the whole point of immutable data! If you reference something else, you need to either:

  1. Store that something else, as a value, fully in the thing that references it. Note that if you do this right the data would be stored only once, just referenced from multiple places. But since the data is still logically just a tree, when you update a property of the referenced thing, you would need to update it everywhere.

  2. Give the thing you want an identity, just as you sketched.

I suspect you need the second. I would argue that depending on how you propagate the changes, this shall not be a problem at all. One tool you can use to easily observe the changes from the outside (from mutable interfaces) is using "lenses". I did an implementation some time ago in a library called Atria (http://github.com/ableton/atria) where lenses helped bind an underlying value-based data-model to a QObject based model API that was used to build an UI using QML.

Thinking about performance in this new world is hard, because the trade-offs are reversed from that of a mutable world. Specially if your system is highly concurrent, I would suspect you would actually see performance gains, even though some operations (certain classes of updates) become slower in a single-threaded scenario. You also have to get used to modelling things and programming in a different way, and there are certain pitfalls that you have to avoid (like in every programming model!). Also, the ability to safely reason locally and separation between logic and data that this promotes is a great tool when optimicing system, for one can be more playful with representation and implement "global optimizations".

If you can afford it, try to make a prototype to prove yourself that this is the right direction. I work as consultant and contractor so I may be able to help with that.

Cheers!

JP

Bob

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/arximboldi/immer/issues/58#issuecomment-423550223

-- ∿∿∿ https://sinusoid.al ∿∿∿ https://sinusoid.es

bobkocisko commented 5 years ago

JP,

Wow! Thanks for such a complete reply. Your suggestions were very helpful and I've had a chance now to dive into ranges-v3, atria/transducers and immer, with some forays back to react/redux, over to elm, and then learning a bit of haskell and clojure too. My brain is tired but very happy. :)

Are the settings dynamic (stored in a hash-map or alike), or static

(just structs)?

Either way, that should be easily manageable with an immutable approach.

I designed a state tree for the app with immutability in mind and it immediately became clear that the app has evolved since its inception to a point where its state is really quite static, so structs and immer::arrays seem to be the answer rather than immer::maps, at least for a first pass. Also although we have to support thousands of external device settings, the UI-related settings are really the only ones that need to be stored in immutable persistent state. Those revelations have opened up some really neat ideas of performance improvements especially related to serialization and reasoning about differences (thanks to boost.hana). Also, we now finally have a use case for C++'s pointer-to-member operator and variadic templates to make accessing and updating specific paths of the (immutable) state tree re-usable and fast. Very cool!

Another reason is that we have hundreds of controls in our UI and we have specific algorithms and signal/slot mechanisms to make sure that only the minimum number of OpenGL draw calls are run to update only the control(s) that have changed rather than re-drawing the whole screen.

This is the part where diffing could be useful, to reconstruct the set of changes from an mixed set of updates. There are some ways you can do this without native immer support, and depending on your data, it might be enough. Another technique you can used to is to provide "hints" through your event paths, that preserve information about the nature or scope of changes that occurred, such that UI updates can be done more efficiently.

I realized that with boost.hana and given the static nature of the state tree, all state tree paths can be represented by a vector. If the type is a struct, then the int is the boost.hana tuple offset index of the member to access. If the type is immer::array, then the int is an offset into the array. So for serialization, we can iterate both trees and for each difference generate this vector, send it across the wire and apply the same update on the other side. The largest individual array in the tree is 27 elements, so iterating over that shouldn't really be a performance issue. And because immer::array efficiently overlaods == we can avoid going into any array if it hasn't changed. Simple and beautiful.

One tool you can use to easily observe the changes from the outside (from mutable interfaces) is using "lenses". I did an implementation some time ago in a library called Atria (http://github.com/ableton/atria) where lenses helped bind an underlying value-based data-model to a QObject based model API that was used to build an UI using QML.

I searched for "lens" in atria but didn't find anything. Has the name changed? How is this different from view_adaptors in ranges-v3?

Thinking about performance in this new world is hard, because the trade-offs are reversed from that of a mutable world. Specially if your system is highly concurrent, I would suspect you would actually see performance gains, even though some operations (certain classes of updates) become slower in a single-threaded scenario. You also have to get used to modelling things and programming in a different way, and there are certain pitfalls that you have to avoid (like in every programming model!). Also, the ability to safely reason locally and separation between logic and data that this promotes is a great tool when optimicing system, for one can be more playful with representation and implement "global optimizations".

I agree. A very welcome paradigm shift. Thinking in FP after thinking for years in OOP feels like finding an island after trying to tread water in a monsoon. :) For instance, one of the recent challenges in our codebase spanned three classes, each containing about two dozen lines of code, and was too complex to bother fully implementing. I just mocked up an FP solution for the same problem using ranges-v3 in about a dozen dead-simple lines of code in one function. And it is a full implementation which covers every case. Amazing.

Cheers!

Thanks again!

Bob

arximboldi commented 5 years ago

Hi Bob!

Wow! Thanks for such a complete reply. Your suggestions were very helpful and I've had a chance now to dive into ranges-v3, atria/transducers and immer, with some forays back to react/redux, over to elm, and then learning a bit of haskell and clojure too. My brain is tired but very happy. :)

You're welcome. And wow, you are learning very fast! It is a fun journey though :)

I designed a state tree for the app with immutability in mind and it immediately became clear that the app has evolved since its inception to a point where its state is really quite static, so structs and immer::arrays seem to be the answer rather than immer::maps, at least for a first pass. Also although we have to support thousands of external device settings, the UI-related settings are really the only ones that need to be stored in immutable persistent state. Those revelations have opened up some really neat ideas of performance improvements especially related to serialization and reasoning about differences (thanks to boost.hana). Also, we now finally have a use case for C++'s pointer-to-member operator and variadic templates to make accessing and updating specific paths of the (immutable) state tree re-usable and fast. Very cool!

Yeah! That sounds to me like the right direction. The flatter are more static the data is, the better. So structs and arrays sound great if you can get away with it (and more often than one would think, you can!).

One note: basically immer::array and immer::vector are very similar, and which one to use depends on the ammount of data and the pattern of updates. Since the API is the same, it makes sense to start with arrays and later move to vectors or flex vectors depending on the use-case.

I realized that with boost.hana and given the static nature of the state tree, all state tree paths can be represented by a vector. If the type is a struct, then the int is the boost.hana tuple offset index of the member to access. If the type is immer::array, then the int is an offset into the array.

Exactly! And since everything is very flat and static, it sounds like descending such tree would be very fast. Also one note: I think boost hana also supports addressing struct members via constexpr strings. Depending on your use-case, it might make versioning the data easier.

So for serialization, we can iterate both trees and for each difference generate this vector, send it across the wire and apply the same update on the other side. The largest individual array in the tree is 27 elements, so iterating over that shouldn't really be a performance issue. And because immer::array efficiently overlaods == we can avoid going into any array if it hasn't changed. Simple and beautiful.

Completely! Sounds like you are really grasping how this paradigm works, and how the duality between state snapshots and diffing works.

I searched for "lens" in atria but didn't find anything. Has the name changed? How is this different from view_adaptors in ranges-v3?

It is a module called funken. The examples (unit tests) are kind of minimal, so it might be a bit hard to grasp how to use it in practice. Anyways, maybe you won't really need it.

Thinking about performance in this new world is hard, because the trade-offs are reversed from that of a mutable world. Specially if your system is highly concurrent, I would suspect you would actually see performance gains, even though some operations (certain classes of updates) become slower in a single-threaded scenario. You also have to get used to modelling things and programming in a different way, and there are certain pitfalls that you have to avoid (like in every programming model!). Also, the ability to safely reason locally and separation between logic and data that this promotes is a great tool when optimicing system, for one can be more playful with representation and implement "global optimizations".

I agree. A very welcome paradigm shift. Thinking in FP after thinking for years in OOP feels like finding an island after trying to tread water in a monsoon. :) For instance, one of the recent challenges in our codebase spanned three classes, each containing about two dozen lines of code, and was too complex to bother fully implementing. I just mocked up an FP solution for the same problem using ranges-v3 in about a dozen dead-simple lines of code in one function. And it is a full implementation which covers every case. Amazing.

Awesome :)

I agree it is a very important tool. In the end, C++ is a multi-paradigm language. I believe that FP comes really handy when dealing with architecture and application logic stuff, but one can still use the most imperative style in low-level code in self-contained capsules for performance. It takes time to get a good intution to when to do that. In the end, we are all learning example by example, including myself.

I am indeed curious about how your project goes. Let me know eventually how it goes, whether it is a success, but also if you find challenges or flaws in the approach.

Cheers!

JP

-- ∿∿∿ https://sinusoid.al ∿∿∿ https://sinusoid.es

bobkocisko commented 5 years ago

Thanks JP. Yeah it's a wild new world, and fun to be on the cutting edge doing it in C++. :) I appreciate your labors to explore this frontier!

Also one note: I think boost hana also supports addressing struct members via constexpr strings. Depending on your use-case, it might make versioning the data easier.

Could you elaborate how this would help with versioning? And is there a way to use those strings at runtime? I'm still green with understanding TMP-to-runtime communication and I haven't actually started using boost.hana yet.

arximboldi commented 5 years ago

Could you elaborate how this would help with versioning? And is there a way to use those strings at runtime? I'm still green with understanding TMP-to-runtime communication and I haven't actually started using boost.hana yet.

It helps with versioning if you are persisting these data-structures and cursors into the data-structure, such that the cursors become more resilient if you reorder the elements of the structs. And yes, I think you can use those strings at runtime too, just iterate over the members of the struct (at compile time or run-time), which might come super-handy to do the diffing in a generic way: https://www.boost.org/doc/libs/1_61_0/libs/hana/doc/html/group__group-Struct.html

bobkocisko commented 5 years ago

Ok that makes sense. Very cool.

On Mon, Oct 8, 2018 at 3:11 PM arximboldi notifications@github.com wrote:

Could you elaborate how this would help with versioning? And is there a way to use those strings at runtime? I'm still green with understanding TMP-to-runtime communication and I haven't actually started using boost.hana yet.

It helps with versioning if you are persisting these data-structures and cursors into the data-structure, such that the cursors become more resilient if you reorder the elements of the structs. And yes, I think you can use those strings at runtime too, just iterate over the members of the struct (at compile time or run-time), which might come super-handy to do the diffing in a generic way: https://www.boost.org/doc/libs/1_61_0/libs/hana/doc/html/group__group-Struct.html

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/arximboldi/immer/issues/58#issuecomment-427947493, or mute the thread https://github.com/notifications/unsubscribe-auth/ACoMHj--_4brzoj-AlZIhDo34ROgqiuHks5ui6M8gaJpZM4Wy8lk .