Open NickSeagull opened 7 years ago
I agree, it is not optimized for anything right now. I am not tied to any naming or representation, things were just set up to figure out the API. Should we start benchmarks first?
Yeah, definitely, we are figuring out API design in dataHaskell :)
You are also welcome to fork into datahaskell github org, and I can give hackage perms if you help me figure that out :) . It'll be a busy week for me and I don't want to be a blocker.
Great, AFAIK there's an admin interface in every package page in which you can also add maintainers. 😁
I'm not sure switching to MVector will achieve anything.
Arrow might be a more worthwhile target: https://arrow.apache.org/
Two cents:
1 I love the API. This is very tidy and not making a wholesale copy of pandas
or R
functionality. This looks like Haskell.
2 I like RFrames
as a name. These are Record frames, right? Why copy a less descriptive name from a less defined implementation?
3 I don’t want to rush your work, especially because being careful already led you to a good API. But the selling point for Haskell as a platform for data science (at least for me) is to be able to do data exploration and munging on a stream of data. I would be more interested in figuring out the right API for that before I started optimising for small data that fits memory.
I’m a beginner Haskeller. Yesterday I went through a productive day at #haskell-beginners (functional programming) Slack and figured out Frames
had enough of my wishlist to let me do some of my work. But I think this project looks tidier, and I would be willing to test it with my real world data. I work for the state government and have plenty of ugly data to play with.
I've been researching about different possible implementations of a dataframe, and it looks like having a
Vector (Vector a)
is quite slower than having other types. Given that Haskell is compiled to machine code it would be ideal to be as fast as a C++ implementation, or at least be somewhere close.( Source )
It would be great if we could migrate to
MVector
or a representation as MutableArrayArray to make this much more faster, and take advantage of libraries like vector-algorithms to make sorting easier. Also, having a stateful representation of our dataframe makes much more sense to me, for people coming from R, Python and even F#.PS: I would really love to rename the
RFrame
type to something likeDataFrame
orDataTable
orTable
or ..., right now it gives the impression that it we are trying to clone R in some way 😄Pinging @ejconlon