ejconlon / analyze

making data science easy and safe with data frames in haskell
BSD 3-Clause "New" or "Revised" License
27 stars 7 forks source link

Radical changes proposal #3

Open NickSeagull opened 7 years ago

NickSeagull commented 7 years ago

I've been researching about different possible implementations of a dataframe, and it looks like having a Vector (Vector a) is quite slower than having other types. Given that Haskell is compiled to machine code it would be ideal to be as fast as a C++ implementation, or at least be somewhere close.

Haskell Data is immutable by default. This is one of the reasons it is such a nice language to program in; it removes a whole class of errors that can occur when multiple threads are modifying a data structure. It also allows the compiler to apply more optimisations.

However the disadvantage is that every time some data is “modified”, a new object (or at least part of it) must be allocated on the heap, and the old object must be garbage collected.

( Source )

It would be great if we could migrate to MVector or a representation as MutableArrayArray to make this much more faster, and take advantage of libraries like vector-algorithms to make sorting easier. Also, having a stateful representation of our dataframe makes much more sense to me, for people coming from R, Python and even F#.

PS: I would really love to rename the RFrame type to something like DataFrame or DataTable or Table or ..., right now it gives the impression that it we are trying to clone R in some way 😄

Pinging @ejconlon

ejconlon commented 7 years ago

I agree, it is not optimized for anything right now. I am not tied to any naming or representation, things were just set up to figure out the API. Should we start benchmarks first?

NickSeagull commented 7 years ago

Yeah, definitely, we are figuring out API design in dataHaskell :)

ejconlon commented 7 years ago

You are also welcome to fork into datahaskell github org, and I can give hackage perms if you help me figure that out :) . It'll be a busy week for me and I don't want to be a blocker.

NickSeagull commented 7 years ago

Great, AFAIK there's an admin interface in every package page in which you can also add maintainers. 😁

Shimuuar commented 6 years ago

I'm not sure switching to MVector will achieve anything.

  1. It has same representation as immutable vector except it uses mutable arrays
  2. It forces you to work in IO/ST monad. It makes it impossible/difficult to interoperate with existing haskell libraries since they expect immutable data
ejconlon commented 6 years ago

Arrow might be a more worthwhile target: https://arrow.apache.org/

dmvianna commented 6 years ago

Two cents:

I’m a beginner Haskeller. Yesterday I went through a productive day at #haskell-beginners (functional programming) Slack and figured out Frames had enough of my wishlist to let me do some of my work. But I think this project looks tidier, and I would be willing to test it with my real world data. I work for the state government and have plenty of ugly data to play with.