chrrasmussen / Idris2-Erlang

Erlang code generator for Idris 2
Other
141 stars 5 forks source link

Improve IORef/Buffer implementations #4

Open chrrasmussen opened 4 years ago

chrrasmussen commented 4 years ago

Data.IORef, Data.IOArray (Built on top of Data.IORef) and Data.Buffer are all abstractions that provide mutability. Implementing them in Erlang turns out to be not so easy (to me, at least).

I recommend to not use the Data.IORef, Data.IOArray and Data.Buffer modules in your code if you plan to generate Erlang, as these primitives currently leaks memory.

Solutions considered Problems encountered
ETS Leaks memory because the run-time does not know when to remove the values.
Custom gen_server Leaks memory because the run-time does not know when to remove the values.
Process dictionary Leaks memory because the run-time does not know when to remove the values.
Atomics Didn't find a way to resize the Atomic after it is created.
NIF Read more below.

NIF

I implemented both the IORef and Buffer types as a NIF: https://github.com/chrrasmussen/mutable_storage [Rust]

The Buffer implementation works as expected, and IORef implementation mostly works. The blocker is that IORef can't store Erlang references, or rather, if it does, the Erlang reference gets serialised and value pointed to by the reference is potentially garbage collected. When reading back the Erlang reference it might not point to anything, which would lead to run-time error if accessed.

Using NIF also makes the project harder to distribute.

Possible solutions

It might be possible to solve the issue in the Idris 2 compiler (or the code generator)

  1. By applying some form of reference counting in the generated code. Would cause some extra run-time overhead.
  2. Using the linear type in some way. Currently, information about the linear types are not exposed to the code generators.

Or a long shot:

  1. The Erlang VM could provide a mutable box type (Same functionality as IORef)
    • Similar to Atomics, except it can store any Erlang term and keep other Erlang references alive.
    • Also similar to ETS, except the values would be garbage collected.
    • With a mutable box type, it should be possible to implement all of the primitives above (IORef, IOArray and Buffer)

Current implementation

The current implementations of Data.IORef and Data.Buffer are using process dictionary (ETS would probably work as well), which means they leak memory.

Qqwy commented 3 years ago

One 'solution' could be to implement IORef as a NIF which stores the value that is put inside fully inside the Rust struct (using e.g. serde_rustler to convert BEAM types to Rust structs). Now the IORef as a whole can safely be given to the BEAM runtime which can GC it when necessary, which will at that point GC the contents of the IORef as well.

The main disadvantage over storing Erlang references is obviously that we need to copy datatypes when putting them in the IORef rather than being able to rely on Erlang's builtin reference counting here. However, it should not leak memory as the IORefs themselves are now GC'd correctly.

Qqwy commented 3 years ago

Interestingly Data.Buffer could probably be very easily implemented using the process dictionary, ETC or one of the many other choices above since currently Idris2 requires buffers to be freed manually as far as I can see using the freeBuffer primitive.

Qqwy commented 3 years ago

Another question: Why is it necessary to be able to resize the internals of an IORef? If this is not actually necessary you might be able to use :atomics after all.

chrrasmussen commented 3 years ago

Thanks for all your suggestions! 😃 I need to look more into them. For now, I will try to answer your questions with my current understanding.

Data.IORef

The blocker is that IORef can't store Erlang references, or rather, if it does, the Erlang reference gets serialised and value pointed to by the reference is potentially garbage collected. When reading back the Erlang reference it might not point to anything, which would lead to run-time error if accessed.

To illustrate this problem in terms of Erlang. I am using atomics in this example, but it applies to any references. The following example works as one would expect (returning the value 0). Run them in the Erlang REPL:

Ref = atomics:new(1, []).
atomics:get(Ref, 1).

The problem is that when the reference is serialised, there are no more references, and the atomics gets garbage collected.

SerialisedRef = term_to_binary(atomics:new(1, [])).
atomics:get(binary_to_term(SerialisedRef), 1).

Results in an error:

** exception error: bad argument
     in function  atomics:get/2
        called as atomics:get(#Ref<0.534261318.1692008460.78940>,1)

Resizing IORef

The way Data.IORef works is that the value it contains can be changed at any later point. The size of the value that is stored in the IORef might vary wildly. There are no upper bound to how much data can be stored in the IORef, which means it needs to be resized if it is too small.

import Data.IORef

main : IO ()
main = do
  ref <- newIORef "small string"

  smallStr <- readIORef ref
  putStrLn smallStr

  writeIORef ref "very long string. very long string"
  longStr <- readIORef ref
  putStrLn longStr

Prints:

small string
very long string. very long string

Both strings are written to the same IORef-reference. I was thinking that maybe it is possible to create a new atomics, but that would also lead to a new Erlang reference.

Data.Buffer

A quick search in the Idris 2's source code indicate that the freeBuffer function is not used. If I remember correctly, the Buffer implementation was changed to use a C implementation at one point (instead of the Scheme implementation), but it was later changed back to use the Scheme implementation. I think freeBuffer is a remnant from that change.

Qqwy commented 3 years ago

There are no upper bound to how much data can be stored in the IORef, which means it needs to be resized if it is too small.

I see. For some reason I thought that it would always itself contain a reference but on hindsight that does not make any sense. That indeed is a clear reason why :atomics would not work.

Also thank you for more information about Data.Buffer.

I expect that the easiest way forward would then be to implement it as a NIF. After looking thorougly at the documentation of Erlang's NIFs I found out that it is possible to pass a destructor function when creating a custom resource type. Looking deeper inside Rustler, it seems that we can create our own cross-process reference-counted box to arbitrary data by using Rustler's existing ResourceArc. It probably already does what we need:

As far as I can see, ResourceArc will not serialize what is to be stored inside, but instead increment the reference pointer of the thing it contains on construction and decrement it on destruction. Its own destruction is of course triggered iff all of the references to the ResourceArc are themselves GC'd by the Erlang VM.

chrrasmussen commented 3 years ago

Thanks again! If it is possible to avoid serializing the data from Erlang, that might work! My current implementation of IORef is built on top of Buffer, which means the data from Erlang is serialized.

I added a small test file that reproduces the error (in branch ioref-test).

My Rust skills is not really up to par. If you would like to give it a try, I would be very happy 😊

I had to go through some hoops in order for the :atomics reference to be deallocated. You can run it using: mix run ioref_test.exs

hansihe commented 3 years ago

Author of Rustler here, just dropping in to give some context/clarifications.

What NIF resources do, is they allow you to opaquely store data inside a handle that is managed and garbage collected by the erlang VM. They pretty much only wrap a pointer, and do not allow storing terms inside by themselves.

When using ResourceArc (our wrapper for resources) in Rustler, you can simply implement Drop for your inner type, and do whatever you need to do when the type inside the ResourceArc is dropped. Drop is a standard rust thing, we just call the drop implementation when the BEAM calls the destructor.

Interestingly, there is actually a way to own and store terms in native data structures in NIFs, and that's owned environments. However, the caveat here is that this requires copying terms into and out of that owned environment whenever you want to pass it to/from the process the NIF runs in. There is also no way to deallocate individual terms from the owned env without clearing the whole thing.

If you simply need to serialize a terms as a binary, the most performant and simple way would be to use Term::to_binary which will encode the term in the ETF format. This supports all term types, but you would still have a problem with references getting GCd.

chrrasmussen commented 3 years ago

Thank you for the insights, @hansihe! ❤️

Also thanks for making Rustler! It was very easy to get it going, even for one that was completely new to Rust.

Qqwy commented 3 years ago

By the way, IORef and Buffer are very low-level primitives which Idris2 uses for increased efficiency. However, I think that it might very well be possible that a couple of datatypes which Idris2 builds on top of IORefs/Buffers would actually be more efficient in Erlang when implemented in their more 'natural' way, since Erlang itself already performs a lot of efficiency-optimizations for e.g. small binaries vs large binaries, small maps vs large maps, iolists, etc.

chrrasmussen commented 3 years ago

@Qqwy That's true.

I think the biggest reason for supporting IORef and Buffer is that they might be used in some Idris 2 libraries. Looking at the modules in the Idris 2 libraries (prelude, base, contrib) I found the following usages:

Another reason is that IORef and Buffer are used in the Idris 2 compiler. The Idris 2 compiler is already running on the BEAM, but it would be even nicer if the Erlang version was close to the same performance as the Chez Scheme version, and that it did not leak memory. With that said, there might be other ways to achieve this: By rewriting the parts of the compiler that uses IORef and Buffer. Rewriting just these parts might not be sufficient though.

In general, I would say that IORef and Buffer are not needed. When writing Idris 2 code that is intended for Erlang there are also other options, like using ETS, GenServer etc.