Radical changes proposal

NickSeagull commented 7 years ago

I've been researching about different possible implementations of a dataframe, and it looks like having a Vector (Vector a) is quite slower than having other types. Given that Haskell is compiled to machine code it would be ideal to be as fast as a C++ implementation, or at least be somewhere close.

Haskell Data is immutable by default. This is one of the reasons it is such a nice language to program in; it removes a whole class of errors that can occur when multiple threads are modifying a data structure. It also allows the compiler to apply more optimisations.

However the disadvantage is that every time some data is “modified”, a new object (or at least part of it) must be allocated on the heap, and the old object must be garbage collected.

( Source )

It would be great if we could migrate to MVector or a representation as MutableArrayArray to make this much more faster, and take advantage of libraries like vector-algorithms to make sorting easier. Also, having a stateful representation of our dataframe makes much more sense to me, for people coming from R, Python and even F#.

PS: I would really love to rename the RFrame type to something like DataFrame or DataTable or Table or ..., right now it gives the impression that it we are trying to clone R in some way 😄

Pinging @ejconlon

ejconlon commented 7 years ago

I agree, it is not optimized for anything right now. I am not tied to any naming or representation, things were just set up to figure out the API. Should we start benchmarks first?

NickSeagull commented 7 years ago

Yeah, definitely, we are figuring out API design in dataHaskell :)

ejconlon commented 7 years ago

You are also welcome to fork into datahaskell github org, and I can give hackage perms if you help me figure that out :) . It'll be a busy week for me and I don't want to be a blocker.

NickSeagull commented 7 years ago

Great, AFAIK there's an admin interface in every package page in which you can also add maintainers. 😁

Shimuuar commented 6 years ago

I'm not sure switching to MVector will achieve anything.

It has same representation as immutable vector except it uses mutable arrays
It forces you to work in IO/ST monad. It makes it impossible/difficult to interoperate with existing haskell libraries since they expect immutable data

ejconlon commented 6 years ago

Arrow might be a more worthwhile target: https://arrow.apache.org/

dmvianna commented 6 years ago

Two cents:

1 I love the API. This is very tidy and not making a wholesale copy of pandas or R functionality. This looks like Haskell.
2 I like RFrames as a name. These are Record frames, right? Why copy a less descriptive name from a less defined implementation?
3 I don’t want to rush your work, especially because being careful already led you to a good API. But the selling point for Haskell as a platform for data science (at least for me) is to be able to do data exploration and munging on a stream of data. I would be more interested in figuring out the right API for that before I started optimising for small data that fits memory.

I’m a beginner Haskeller. Yesterday I went through a productive day at #haskell-beginners (functional programming) Slack and figured out Frames had enough of my wishlist to let me do some of my work. But I think this project looks tidier, and I would be willing to test it with my real world data. I work for the state government and have plenty of ugly data to play with.

ejconlon / analyze

Radical changes proposal #3