Open NickSeagull opened 7 years ago
@ocramz @glutamate what do you guys think about storing columns the following way:
data Column
= I !(Vector Int)
| R !(Vector Double)
| B !(Vector Bool)
| ...
| Untyped !(Vector Text)
data Frame = Frame
{ _frameColumns :: !(Vector Column)
, ...
}
If you use Vector, I think that precludes any chunking or streaming-like solution for very large data sets. I'd see this type of structure as Frames.Strict, implying a Frames.Lazy that deals with chunking. Obviously the Text and Bytestring APIs are inspiring the idea.
Perhaps the container (Vector) could be polymorphic?
That is a good idea @tonyday567, the thing is that one should then reimplement all of the operations based on the container, right?
I was thinking for a first version to use vector-algorithms
and statistics
as they all operate on Vector and make it faster to go for a first version.
On the other hand, if we were to use a streaming package like Conduit, or Streaming, how would one for example sort a dataframe without loading it all?
I started some experiment here: https://github.com/glutamate/analyze/blob/playground/src/Analyze/New.hs
sort a dataframe without loading it all?
If your input doesn't fit into memory then you won't be able to do an in-memory sort. Map-reduce methods are what you do in practice - there's usually a sort between the map and the reduce.
Looking at the @glutamate experiment, I think this will resolve naturally as what you can do with a FrameContainer and what you can do to a Frame.
Yes, definitely. But I was rather thinking about when one wants to work with small data. Like a 500mb CSV file
Added an arbitrary instance to the @glutamate experiment, mostly to build some intuition. Worked well. sample (arbitrary :: Gen (Column []))
I figured a Show dependency was ok - couldnt imagine a csv field that wasn't a Show.
https://github.com/tonyday567/analyze/blob/arb/src/Analyze/New.hs
Looks great @tonyday567 , also about sorting: It might be feasible to implement external sorting
To have it in a persistent format, so anyone can join and discuss, I'm making this issue with different ideas that have been discussed in Gitter. Copy/pasting: