Gabriella439 / foldl

Composable, streaming, and efficient left folds
BSD 3-Clause "New" or "Revised" License
159 stars 51 forks source link

Direction of `foldl` library #10

Closed Gabriella439 closed 10 years ago

Gabriella439 commented 10 years ago

This is partially an update and partially a solicitation for opinions on certain design choices for this library.

So I've designed pipes libraries to not directlyl depend on foldl so that I can be much freer about adding dependencies to foldl. The reason why is that these other issues have indicated that foldl may require many other dependencies including, but not limited to:

I wanted to see if people would be fine with the above dependencies or if they had other dependencies that they thought might be worth adding. I'd like to develop this into a more fully featured library.

Also, I've decided to freeze the core type (except perhaps to unify FoldM and Fold into a single type as @michaelt suggested), so there won't be any new changes on that front. I'm pretty happy with its current functionality.

Shimuuar commented 10 years ago

I think there're a lot of thing to be explored. So it's way too early to say that design is frozen. Here is most important in my opinion:

  1. Many fold accumulators admit efficient merge (element count, sum, max/min, mean etc.). For example if we have sums of two different samples we can get sum of their union by adding two sums. It's not clear for me how to integrate it to foldl library.
  2. Some algorithms require several passes over data. For example numerically stable algorithms for variance and higher central moments require estimate of mean as input.
  3. Transformations of input stream. As soon as we start working with data which more complicated than bunch of numbers mapping and filtering becoms indispensable.
  4. Common interface for data samples. There's wide range of different data types over which we want to fold. It would be quite nice to unify them under same API. And Foldable wont work because some data types are monomorphic (text,bytestring) and some not polymorphic enough (unboxed vectors etc.)

With regard to profunctors I'm not sure that instance is useful enough to warrant a dependency. And if we add monads to fold it becomes impossible to define comonad instances. We lose extend but not duplicate. So no comonad instance for FoldM.

If we to add dependency on text and bytestring we should add vector as well

michaelt commented 10 years ago

The question of dependencies was somewhat impeding my own thinking what it might be useful to do. A lot seemed to turn on whether or how you were going to introduce the Fold/M type into the pipes universe. It is clear you want to keep pipes itself low on dependencies. The 'lenses' branch of pipes-parse that was linked on the list imported folds when I looked, but it has been taken out again, if I follow.

It is slightly annoying, but maybe one should concede that there will be a couple of packages; the trouble I had thinking about this in the past was how to divide them.

One of them might be something like foldl-instances with fold functions for things like bytestring, text, vector, maybe Producer and FreeT (Producer ..), and various base types. instances is the wrong word for those of course, but this might be a place for things like a profunctor instance, if that seems reasonable.

Another foldl-extras -ish idea would be to include more interesting and perhaps dangerous Folds like the (pointlessly hyperspecialized) ones in the statistics package -- an optimized average, the diverse sorts of mean, histogram folds etc. It's plain Shimuuar has seen the connection (and has actual experience with the material.) Perhaps such a package would be the place for text/bytestring/vector fold functions?

I do think that the pre-packaged folds are what makes this library distinctive and potentially widely used. The present library has a rather austere collection of folds but these are enough to impart the fundamental idea. One pedagogically important point is missing -- namely the obvious one that you apply the same Folds to diverse sequential types. It might make sense to add fold functions for a few other base types, and maybe also the general Foldable application, with remarks on its mixed merits? (I cant remember where it had trouble but it did)

I havent thought about it but there will obviously be fold functions for things like conduit and so forth. But I guess micro-packages like foldl-conduit or conduit-foldl are what one does. So perhaps also foldl-pipes? (Maybe one should brazenly add a pipes-dependency for foldl since pipes itself has so few and then add Producer and FreeT Producer folds?) Maybe I will make a couple such little packages to see what they look like. Note that there will need to be a specialized Char/Word8 fold functions for Text and ByteString producers.

Shimuuar commented 10 years ago

Actually I consider foldl as possible foundation for calculation of mean/variance etc in statistics package. Current versions could only work with vectors which isn't satisfactory.

Gabriella439 commented 10 years ago

Yes, one of the things on my TODO list is to create numerically stable statistics like those found in the statistics package

Shimuuar commented 10 years ago

I've created experiment branch in my fork to play with possible designs. It seems that rabbit hole is quite deep.

michaelt commented 10 years ago

By the way, I'm not sure a comonad instance for FoldM is out of the question, not that I'm sure what use it is. Is there an obvious objection to this type? :

  data FoldM i m o = forall s . FoldM !s (s -> o) (s -> i -> m s) 

It has an obvious extract; I'm not sure the natural duplicate or extend follows all the laws. It seems to me that this type is somewhat recommended by a couple of simple analogies. If we start with the functor

  data Request i o s = Request o (i -> s)

i.e., (o, i -> s), then we can redescribe Fold i o as

  data Fold i o = forall s . Fold s (s -> Request i o s)

which is the 'co-church' encoding corresponding to the 'normal' recursive type:

  data Fold i o = Fold {runFold :: Request i o (Fold i o)}

How do we insert a monadic aspect into a standard recursive type like that? Most respectably, I guess:

  data FoldM i m o = FoldM {runFoldM :: m (Request i o (FoldM i m o)}     --  call this YYYYYYYYY

and less respectably:

  data FoldM i m o = FoldM {runFoldM :: Request i o (m (FoldM i m o))}      --  XXXXXXXXX

I am trying to arrange an obvious similarity between FoldM, thus explained, and something like Server a b m Void or Client a b m Void (or maybe I mean forall r . Server a b m r?)

Anyway, the non-recursive encodings corresponding to those two choices are

 data FoldM i m o = forall s . FoldM s (s -> m (Request i o s))

and

 data FoldM i m o = forall s . FoldM s (s -> Request i o (m s))

respectively. Working back now toward more familiar types, these in turn are equivalent to

 data FoldM i m o = forall s . FoldM s (s -> m (o, i -> s))     --- WWWWWWW

and

 data FoldM i m o = forall s . FoldM s (s -> (o, i -> m s))

respectively. The latter of these is then basically

 data FoldM i m o = forall s . FoldM s (s -> o) (s -> i -> m s) 

Reasons favoring something like WWWWW over this last one will be like reasons for preferring the one I marked YYYYYY over the one marked XXXXX. It looks like there may be delicate points like those we find the current Proxy definition, which avoids monadicizing everything as it used to and as it would if it were defined with FreeT on some base functor.

Or anyway, I was wondering about this.

Shimuuar commented 10 years ago

By the way, I'm not sure a comonad instance for FoldM is out of the question, not that I'm sure what use it is. Is there an obvious objection to this type? :

data FoldM i m o = forall s . FoldM !s (s -> o) (s -> i -> m s)

There is. It doesn't work nicely with stateful accumulators (e.g histograms). One need monadic action to convert mutable vector to immutable one.

On the other hand it's possible to have something comonad-like with extractM :: FoldM m a b → m b. I think it will obey comonadic laws (maybe those will need small adjustment)

pthariensflame commented 10 years ago

@Shimuuar Perhaps a comonad in the category of Kleisli arrows over m?

Shimuuar commented 10 years ago

No idea. I'm not well versed in generalized abstract nonsence (-:

michaelt commented 10 years ago

I see; I don't know how these histogram folds work. I take it the problem with a FoldM m a (m b) is that the applicative combination with a FoldM m a c is unpleasant? My main thought had to do with finding the 'right type' in this business, and making things easy for the compiler to optimize by de-monadicizing (?) as much as possible. Admittedly the s -> m o function isn't particularly problematic. The analysis above occurred to me trying to figure out https://github.com/Gabriel439/Haskell-Foldl-Library/pull/3 ; one solution is a Skip-constructor-like solution. But such things are likely to complicate the types too much.

Gabriella439 commented 10 years ago

So, the comonad instance is pretty low on my list of priorities. I can wait until somebody actually needs it.

Shimuuar commented 10 years ago

@michaelt On second thought it would be more problem of style. Consider following code:

data FoldM i m o = forall s . FoldM !s (s -> o) (s -> i -> m s)

someFold = IO (FoldM Int IO (IO Int))
someFold = do
  ref <- newIORef 0
  return $ FoldM ref readIORef (\_ n -> modifyIORef ref (+n) >> return ref)

Here we're forced to wrap return type into monad. It's more of incovenience but I don't think comonad instance worth it.

I think this data type does form a comonad as you porposed. Problem arise with cokleisli arrows. They have type: FoldM a m b → b but now we have only single definition which polymorphic in mextract. Any attempt to fold something into FoldM will yield m (FoldM a m b)

Gabriella439 commented 10 years ago

Yeah, I agree that the Comonad instance is probably not worth the trouble for now.

I also wanted to mention that I just setup the pipes-foldl library on Github. You can find it here:

https://github.com/Gabriel439/Haskell-Pipes-Foldl-Library

Here's some example code for how you would create a wc-like program to count lines (using the lenses branches of pipes-parse and pipes-bytestring):

import Control.Applicative (pure)
import Lens.Family ((^.))
import Control.Foldl (sum)
import Pipes.Foldl (fold, folds)
import Pipes.ByteString (stdin, lines)
import Prelude hiding (lines, sum)

main = do
    numLines <- fold sum (folds (pure 1) (stdin ^. lines))
    print numLines

Note that if you added a Num instance for Fold then you could replace the pure 1 with just 1. Also, note that you can use Pipes.Prelude.sum instead of fold sum, but I just wanted to show off the use of the library.

michaelt commented 10 years ago

@Shimuuar, my remark above wasn't really aimed at a comonad instance but at possible simplification, perhaps in aid of assimilating Fold and FoldM. It was ill-considered though, since I wasn't taking account of the use of FoldM with mutable and ref types. Your argument that these are among the best uses of FoldM is completely convincing.

michaelt commented 10 years ago

@Gabriel439 I made a similar package and a few others, in order to think out how dependencies were supposed to work. Here are a couple of the others, for what they're worth https://github.com/michaelt/foldl-conduit https://github.com/michaelt/foldl-instances The second suggests a type class like the one Shimuuar envisaged somewhere in the discussion. The conduit one I had only started thinking about; it is just holding the place of something outside the system, for purposes of thinking about dependencies.

Shimuuar commented 10 years ago

Fold vs FoldM

Well it's always possible to convert Fold toFoldM:

toFoldM :: Monad m ⇒ Fold a b → FoldM m a b
toFoldM (Fold f x0 out) = FoldM (\x a → return (f x a)) x0 (return . out)

It means that we can easily replace Fold with FoldM. The only question is whether we lose anything except possibly performance. I think no. Fold have no interesting structure. It's special case of FoldM where monad doesn't do anything of interest. If we add monoidal structure to the accumulator (see #11 for details) and I'm not sure whether it's possible to generalize monoidal fold to use monads.

Data sample representation

Recently I found rather nice way to encode any kind of data sample:

newtype Sample a = Sample (∀ r. Fold a r → r)

It encapsulates both data and folding function. It also form monoid so it's possible to concatentate any container whatever data representation it uses.

instance Monoid (Sample a) where
  mempty = Sample extract
  mappend (Sample f) (Sample g) = Sample (f =<= g) -- Cokleisli arrow

Now it's possible to define type class for arbitrary data sample:

class IsSample s where
  type Elem s :: *
  toSample :: s → Sample (Elem s)

So far I ignored monadic folds but it's quite simple to accomodate them too. Since we require code to work with any monad we can specialize to Identity and work with pure folds just as well.

newtype Sample a = Sample (∀ r m. Monad m => Fold m a r → r)
Gabriella439 commented 10 years ago

So I just added purely and impurely which mean we don't actually need pipes-foldl any longer. The basic idea is that they take any function written to accept a step function, initial accumulator, and extraction function, and convert it to a function that accepts Folds or FoldMs. This means that both pipes and conduit can add foldl-compatible functions without incurring a foldl dependency. As a bonus, it downgrades gracefully in the scenario where the user does not want to use foldl since the un-upgraded functions are still usable on their own.

This also inadvertently solved another problem. The main reason I had two separate types was that I thought there was no convenient way to unpack a FoldM as if it were a pure Fold, but now if we merge the two types into a single type I can just rewrite the purely function to automate all the Identity wrapping and unwrapping so that it is transparent for the user.

I rememember @michaelt mentioned previously that he had profiled Fold implemented in terms of FoldM and said that it ran the same and did not leak space. Do you still have those tests? If it really is equally efficient then I'd be happy merging the two types at this point, and probably adding the following two type synonyms:

type Fold = FoldM Identity

type Fold'  a b = forall m . (Monad m) => FoldM m a b
Gabriella439 commented 10 years ago

I'm writing a separate comment to address some points brought up by @Shimuuar in his first comment that I didn't have time to address until now.

Many fold accumulators admit efficient merge (element count, sum, max/min, mean etc.). For example if we have sums of two different samples we can get sum of their union by adding two sums. It's not clear for me how to integrate it to foldl library.

Can you elaborate on this? Wouldn't liftA2 (+) be good enough for this purpose or did you have something else in mind?

Some algorithms require several passes over data. For example numerically stable algorithms for variance and higher central moments require estimate of mean as input.

Just have those folds take the mean estimate as the parameter and let the user decide how to derive that estimate (either by folding the data in a separate pass themselves or by obtaining the estimate through some other means).

Transformations of input stream. As soon as we start working with data which more complicated than bunch of numbers mapping and filtering becomes indispensable.

Alright. I still have an issue for this open because I will probably add this.

Common interface for data samples. There's wide range of different data types over which we want to fold. It would be quite nice to unify them under same API. And Foldable wont work because some data types are monomorphic (text,bytestring) and some not polymorphic enough (unboxed vectors etc.)

I've already addressed this by adding monomorphic folds for text and bytestring. I'd prefer to just add monomorphic type-specific folds rather than unify them under the same API. Even if I were to unify them under the same API the problem of unifying the Text and ByteString interfaces is outside the scope of the foldl library.

With regard to profunctors I'm not sure that instance is useful enough to warrant a dependency. And if we add monads to fold it becomes impossible to define comonad instances. We lose extend but not duplicate. So no comonad instance for FoldM.

Right now the balance I'm trying to strike dependency-wise for foldl is to only depend on packages in the Haskell platform, so that means that I will wait to add a profunctors dependency until Edward gets profunctors into the platform.

If we to add dependency on text and bytestring we should add vector as well

Well, what I think I will do is just upgrade fold and foldM to use a Foldable constraint even if there is a slight performance penalty. It feels like the right thing to do in the long run and it's a backwards-compatible change. That would cover boxed vectors. Then I can add a foldVector function for unboxed vectors. I don't mind the dependency on vector at this point.

Gabriella439 commented 10 years ago

Regarding Sample, I think it doesn't buy you anything that the Comonad instance does not already give you, since the Comonad operators are strictly more general. I would prefer to wait on comonad to get into the Haskell platform rather than add a Sample type.

Gabriella439 commented 10 years ago

Never mind, I will provide separate Foldable functions because I just remembered how slow the Foldable versions are.

Shimuuar commented 10 years ago

On 12/29/2013 12:02 AM, Gabriel Gonzalez wrote:

Can you elaborate on this? Wouldn't |liftA2 (+)| be good enough for this purpose or did you have something else in mind?

Something else. If you counted number of elements in two lists: xs and ys then you can calculate nubmer of elements in xs++ys by summing two answers. Same goes for sum, maximum, etc. I hope it's clear enough

It's actually quite powerful thing. It allows to process different chunks of data independently whether to gain parallelism or because parts of data are of interest themselves. For example one performed N experiments and interested not only in aggregate statistics but whether experimets give compatible results. He could calculate answers for each experiment and then merge results without traversing data second time.

It however makes thing much more complicated.

Just have those folds take the mean estimate as the parameter and let the user decide how to derive that estimate (either by folding the data in a separate pass themselves or by obtaining the estimate through some other means).

If we're speaking about statistical library it's completely unacceptable from usability PoV. Such simple and basic operation must be simple one-liner.

Shimuuar commented 10 years ago

On 12/29/2013 12:04 AM, Gabriel Gonzalez wrote:

Regarding |Sample|, I think it doesn't buy you anything that the |Comonad| instance does not already give you, since the |Comonad| operators are strictly more general. I would prefer to wait on |comonad| to get into the Haskell platform rather than add a |Sample| type.

I disagree it allows to write function which will work with different types of containers. It becomes posible to write function which calculate mean for list/vector/histogram

Monomorphic folds (for bytestring/text/etc.) allows to apply different folds to single type of data. This type class allows opposite thing take single fold and use with different types of data containers.

Gabriella439 commented 10 years ago

Something else. If you counted number of elements in two lists: xs and ys then you can calculate nubmer of elements in xs++ys by summing two answers. Same goes for sum, maximum, etc. I hope it's clear enough

I understand now. This is only possible if we change the internal implementation of Fold back to the original formulation that I had in the first draft of this library:

data Fold a b = forall x . (Monoid x) => Fold (a -> x) (x -> b)

If the accumulator is a Monoid then it's trivial to do what you are asking. However, there were three reasons that I went with the non-Monoid version at the time:

If we're speaking about statistical library it's completely unacceptable from usability PoV. Such simple and basic operation must be simple one-liner.

Can you give me an idea of what API you had in mind?

I disagree it allows to write function which will work with different types of containers. It becomes posible to write function which calculate mean for list/vector/histogram

What I was asking is what the purpose of the newtype was. If you supplied a Comonad instance for Fold then you could use the CoKleisli operators instead of the Monoid operators.

Monomorphic folds (for bytestring/text/etc.) allows to apply different folds to single type of data. This type class allows opposite thing take single fold and use with different types of data containers.

I'm very reluctant to define new type classes unless there is a very compelling reason because type classes require a lot of buy-in from the rest of the Haskell ecosystem. The other problem is that you are asking for what is essentially the ListLike type class and the solution to that problem is out of scope for the foldl library.

Shimuuar commented 10 years ago

On 12/29/2013 04:54 AM, Gabriel Gonzalez wrote:

I understand now. This is only possible if we change the internal implementation of |Fold| back to the original formulation that I had in the first draft of this library:

data Fold a b = forall x . (Monoid x) => Fold (a -> x) (x -> b)

If the accumulator is a |Monoid| then it's trivial to do what you are asking. However, there were three reasons that I went with the non-|Monoid| version at the time:

  • I couldn't figure out how to convert left folds (i.e. the current representation) to the |Monoid| representation without leaking space.
  • The current representation has better constant factors
  • You can't (easily) parametrize type class instances, and there were some folds that I wanted to parametrize by a weight argument. An example is an exponentially weighted moving average where you want to parametrize the moving average by the weight.

What about storing monoid dictionaries and current state explicitly:

data Fold a b = ∀x. Fold (a → x) (x→x→x,x) x (x → b)

Since we are storing current state there's no space leak and it's possible to parametrize explicit dictionaries.

If we're speaking about statistical library it's completely
unacceptable from usability PoV. Such simple and basic operation
must be simple one-liner.

Can you give me an idea of what API you had in mind?

Best idea I came up so far is something along the lines:

newtype MFold a b = MFold (Sample a → b)

It's possilbe to chain such folds. It's actually a reader monad.

chain :: MFold x a → (a → MFold x b) → MFold x b chain = (>>=)

Sample arose during experimentation with such N-pass folds. I tried to hide fold state and make fold non-resumable.

But most severe problem is not N-pass folds. It's entirely possible to use ad-hoc solutions. Problem is that single fold accumulator can be used to estimate several statistic at once. For example online variance algorithm could be used to estimate number of elements, mean, max. likelyhood/unbiased variance/std. deviation or any combination of them.

Gabriella439 commented 10 years ago

So one problem I just realized last night is that if you try to make folds parallelizable you can't make them effectful any longer since you have to preserve the ordering of effects.

Shimuuar commented 10 years ago

I also found it difficult to formulate meaningful law for effectful parallelizable folds. There's another concern. I think it's not reasonable to require user to define merge function if all he want some aad hoc fold. I think only way is to create several fold types.

Gabriella439 commented 10 years ago

Well, I would like to keep foldl as simple as possible, though, mainly because Edward already covered the opposite extreme in his folds library. I wanted something that is very approachable for beginners who don't know how to write complex folds that are efficient and don't leak space.

PierreR commented 10 years ago

Can't foldl be built on top of folds to provide a simpler API ? Is there any hope to see a single ´fold´ lib ?

To end users, the duplication of libs with common goals as exemplified by mtl vs transformers is quite confusing.

At least at first and second glance ;-)

That said, folds does not look like an easy one to digest.

Gabriella439 commented 10 years ago

So the first problem is that folds has a ton of dependencies and I would like foldl to get into the Haskell platform eventually.

However, the nice thing about foldl is that it has two functions which greatly encourage dependency-free interop with other libraries. Check out the latest addition of purely and impurely, which allows other libraries to reuse foldl without actually depending on it. I use this to mix foldl with pipes libraries without incurring a foldl dependency.

The other thing is that foldl is designed to be as conceptually simple as possible because it is supposed to be a teaching tool. folds is very beginner-unfriendly, and building foldl on top of folds basically makes its implementation inscrutable and useless as a teaching tool.

PierreR commented 10 years ago

I do agree about the teaching aspect. Reading your source is quite approchable in general.

Are you planning to add a profunctor or even a comonad instance as suggested by Tom and Edward ?

Gabriella439 commented 10 years ago

So my plan is to fake profunctor for now by adding mapped and then add it once it gets into the platform. Same thing for comonad, I will wait for it to get into the platform.

Gabriella439 commented 10 years ago

I will go ahead and close this issue. What I concluded from this discussion is that:

Anything more complicated than that should probably go into a derived library. The exceptions are functions like filtered and other input transformations. I'm still open to the idea of including them but I need to spend more time reviewing this.