`eqsat` issue roundup - Githubissues

I've been using impure-containers in my eqsat project, and I've come up with some issues that you may be interested in. I'm going to comment on this issue with each one, to avoid spamming you. Hackage is sorely lacking a good consistent package for mutable containers, and impure-containers is the best thing I've seen so far :-), so I'd like to see some improvements (and will likely write up PRs after I graduate on May 10).

cc: @chessai

I've been writing some mutable containers that you may want to add to impure-containers:

EqSat.Internal.MHashMap (master, permalink) is a wrapper over Data.HashMap.Mutable.Basic that adds forM_, insertWith, length, member, null, freeze, and thaw.
EqSat.Internal.MHashSet (master, permalink) is a newtype for MHashMap k () with a similar API to EqSat.Internal.MHashMap.
EqSat.Internal.MStack (master, permalink) is a mutable stack that is contiguous in memory (as opposed to most stack implementations on Hackage, which tend to be something like newtype MStack s a = MStack (MutVar s [a]). This is pretty close to your Data.ArrayList.Generic module, except it has more functions (e.g.: pop, thaw), better documentation, and it reallocates when the size of the underlying vector is double the number of actual elements in the stack (it checks each time you pop). This module is pretty close to being ready for inclusion in impure-containers. The only issue that isn't performance-related I can think of is that newWithCapacity is kind of useless since there is no way to pop without compacting.
EqSat.Internal.MBitmap (master, permalink) is a monomorphic wrapper over the MutableBitmap type defined in foundation. It needs a little documentation / cleanup work, but it has some nice property tests using hedgehog.

Since not many people depend on impure-containers yet, perhaps it would be worth revisiting the module structure. I think it's nicer to have a less-nested module hierarchy, where the names at the leaf nodes are always the same as the type that module exports (so the module exporting MHashMap is called Foo.Bar.MHashMap). I think it would be nice to adopt a convention where the mutable version of a structure is called MFoo and the immutable version is called IFoo (MutFoo/ImmFoo would also be fine), though I can understand if you think that's not worth pursuing. I'm thinking that the renamed module hierarchy could look like this (Foo ↦ Bar means that that the datatype currently defined as Foo should be renamed to Bar):

ImpureContainers.MArrayListB: ArrayList specialized for Data.Vector.MVector and renamed to MArrayListB.
ImpureContainers.MArrayListU: ArrayList specialized for Data.Vector.Unboxed.MVector and renamed to MArrayListU.
ImpureContainers.MArrayListP: ArrayList specialized for Data.Vector.Primitive.MVector and renamed to MArrayListP.
ImpureContainers.MArrayListS: ArrayList specialized for Data.Vector.Storable.MVector and renamed to MArrayListS.
ImpureContainers.MArrayListG: ArrayList ↦ MArrayListG.
ImpureContainers.IGraph: Graph ↦ IGraph, SomeGraph ↦ SomeIGraph.
ImpureContainers.IGraph.IVertices: Vertices ↦ IVertices, vertices*
ImpureContainers.MGraph: MGraph, insert*
ImpureContainers.MGraph.MBVertices: MVertices ↦ MBVertices, verticesReplicate, verticesWrite, verticesRead.
ImpureContainers.MGraph.MUVertices: MUVertices, verticesUReplicate, verticesUWrite, verticesURead.
ImpureContainers.GraphTypes: reexports ImpureContainers.GraphTypes.*.*, IGraph, IVertices, MGraph, MBVertices, MUVertices, defines type MGraphIO = MGraph RealWorld and type MGraphST s = MGraph s.
ImpureContainers.GraphTypes.Vertex: Vertex, vertexInt.
ImpureContainers.GraphTypes.Size: Size, sizeInt.
ImpureContainers.MHashMap: MHashMap and its functions.
ImpureContainers.MHeapC: RawHeap ↦ MHeapC and its functions.
ImpureContainers.MHeapD: Heap ↦ MHeapD and its functions.
ImpureContainers.UnsafeMaybe: UnsafeMaybe and its functions.
ImpureContainers.MMaybeArray: MutableMaybeArray ↦ MMaybeArray and its functions.
ImpureContainers.BoolByte: BoolByte.
ImpureContainers.MMaybeVar: MutMaybeVar ↦ MMaybeVar.
ImpureContainers.MPrimArray: MutablePrimArray ↦ MPrimArray and its functions.
ImpureContainers.ITrie: Trie ↦ ITrie and its functions.
ImpureContainers.MTrie: MTrie and its functions.

I'd also like to adopt a convention (that you already seem to be adopting in some modules) where the user is expected to import these modules qualified, so we don't have to suffix the name of the type onto the function (e.g.: newPrimArray just becomes new and the user is expected to use it as MPrimArray.new). If that sounds inconvenient to you, I'd also be willing to have a top-level ImpureContainers module that reexports all of the types and wrapper functions (so ImpureContainers.MPrimArray would export MPrimArray, new, etc. whereas ImpureContainers would export MPrimArray, newMPrimArray, etc.).

The MTrie type seems bad to me because it causes problems with the garbage collector (too many MutVars). One idea for a solution to this that I had is a Haskell implementation of malloc/free for a MutableByteArray#, so that you can keep the trie in one contiguous blob of memory for cache coherence (and, if the trie is not too big, you could use Word32 offsets instead of pointers for compactness, though the lack of a Word32# means this may slow down some operations). You could also maybe use compact regions, though I suspect that it would be slower than a custom allocator.

First of all, some broader context around this library. As you can see by the commit history, I've not done much with this in the last two years. The only decent thing that this library offers is the Graph-related stuff (mostly because unlike every other graph library on hackage, it allows you to keep track of the fact that a vertex belongs to a graph). The heap described as "model C" is pretty cool too, and fortunately, the highly-useful PrimArray got merged into primitive recently (not released to hackage yet). I consider everything else trash (or nearly trash). As you point out, MTrie has disastrously bad performance. I used it once to build a trie that I needed to put several million entries into, and the garbage collector just ran 100% of the time. So, really bad.

But there's something else flawed with this whole approach to mutable data structures: it makes GHC generate suboptimal code. Let's look an abbreviated, specialized version of your MStack type and the interface surrounding it:

data MStack s a
  = UnsafeMkMStack
       !(MutVar s Int) -- length
       !(MutVar s Int) -- vector capacity
       !(MutVar s (MutableArray s a))
new :: ST s (MStack s a)
push :: MStack s a -> a -> ST s ()
pop :: MStack s a -> b -> (a -> b) -> ST s b

Cool. What's bad is that we have to put everything behind a MutVar. If someone wants to use a stack like this in a tight loop, they might do this:

myFunc :: MStack s Int -> ST s ()
myFunc !s = go 1000 where
  go n = if n > 0
    then do
      foo <- pop s 42 id
      ... -- do something with foo, possibly pushing more onto the stack 
      go (n - 1)
    else return ()

Every call to push/pop causes readMutVar or writeMutVar to get called, which means that stuff gets written to memory. It's unfortunate because, normally, in a tight loop like this, GHC would end up generating a nice tight loop that just puts the arguments in registers. In fact, we can make it do this by making the interface less safe:

data MStack s a
  = UnsafeMkMStack
       !Int -- length
       !Int -- vector capacity
       !(MutableArray s a)
new :: ST s (MStack s a)
push :: MStack s a -> a -> ST s (MStack s a)
pop :: MStack s a -> b -> (a -> b) -> ST s (MStack s a, b)

Of course, now push and pop have an unenforceable invariant: The argument MStack must not be reused. Now, myFunc would become:

myFunc :: MStack s Int -> ST s (MStack s Int)
myFunc = go 1000 where
  go n !s1 = if n > 0
    then do
      (s2,foo) <- pop s 42 id
      ... -- do something with foo, possibly pushing more onto the stack 
      go (n - 1) s3
    else return ()

And myFunc now has the whole "this in linear in the MStack argument" thing going on as well. Again, unenforced by the type system, and users are bound to mess up. But, the generated code is much better. By this point, you probably see where I'm going. The interface I really want depends on the successful implementation of the Linear Haskell proposal:

data Stack a = UnsafeStack !Int !Int 
       !Int -- length
       !Int -- vector capacity
       !(MArray a)
new :: (Stack a ⊸ Unrestricted b) -> b
push :: Stack a ⊸ a -> Stack a
pop :: Stack a ⊸ b -> (a -> b) -> (Stack s a, Unrestricted b)

It has the Linear Haskell's tell-tale CPS jankery, but it's both fast and safe. And it doesn't have to concern itself with the PrimMonad weirdness. This is the interface I actually desire for all the mutable data structure I use in practice.

Aside from the fact that I want to destroy this package and build a linear-containers package once Linear Haskell lands, I agree with all of your suggested changes. The name changes sound good, and the additions to the API are good too. I don't mind breaking changes to the API since I'm pretty sure no one uses this package but us. I'm going to give you a commit bit to this repo, so feel free to do any of the things you described.

One more point I'll add about the whole linear-vs-ST thing. The linear implementation certainly will win if you have a single stack that you're just passing through the whole program (especially if it goes through a bunch of tail recursive functions like in the example above). Where it gets more interesting (and I have absolutely no real data on this, only thoughts) is where you have stacks that are a part of other data structures, and the other data structures are used in such a way that GHC isn't able to optimize away all the data constructor allocations like it could in the example I gave above. Now, modifications to a Stack will sometimes cost something at runtime, since an allocation actually happens for the data constructor. But, your ST implementation never allocates (actually, since you use MutVar#, which boxes its argument, it allocates for the I# data constructor every time you do anything, but if you used something like PrimRef from my deprecated prim-array package, it wouldn't). I wonder if there is any code for which the occasional allocations of the linear variant outweigh the bad CPU register utilization of the ST variant.

And of course, no discussion of this issue would be complete without mentioning the relevance of the frighteningly longstanding Mutable Constructor Fields Proposal.

Ah, yes, I was thinking that MADTs would improve things. I was frustrated by the boxed-ness of MutVar#, is there a non-deprecated replacement for PrimRef, or would I just need to hack it together with a MutableByteArray#?

You should probably just copy the whole PrimRef module into impure-containers. There's nothing really wrong with it. It's just that a package named prim-array is sort of goofy now that the type is in primitive instead. And, of course, impure-containers rolls its own copy of PrimArray as well since I never got around to making it depend on prim-array.

andrewthad / impure-containers

`eqsat` issue roundup #7