DataHaskell / analyze

Other
4 stars 2 forks source link

[DISCUSSION] - Going to 1.0? #14

Open NickSeagull opened 7 years ago

NickSeagull commented 7 years ago

To have it in a persistent format, so anyone can join and discuss, I'm making this issue with different ideas that have been discussed in Gitter. Copy/pasting:

what is this monstrosity: _frameData :: !(Vector (Vector Text))
agree on tidying module
@NickSeagull is Series for e.g. time series?
Nikita Tchayka @NickSeagull 12:10
@glutamate on Series yes, could be used as a time series, or just like a "custom indexed vector" we are taking as an example this http://bluemountaincapital.github.io/Deedle/series.html
On the monstrosity, thats how the original author implemented it, and before changing everything and prematurely optimizing I think it is better to have something functional that everyone can use, even if its not that efficient. Why do we use Text? Because by default all values are "untyped" and then the user gets them with a type as:
myDataFrame
& Frame.getColumn "age" (Series.as :: Int)
& <do stuff on the column>
Tom Nielsen @glutamate 12:11
I thought originally it was Vector (Vector v) ie. you could have whatever type you wanted?
if you want to have it really untyped, i think Dynamic is better than Text
but i liked having v as the base type, it's simple and gives me Functor
so i can map as I like
Marco Zocca @ocramz 12:13
yep
Nikita Tchayka @NickSeagull 12:13
That is true, but what happens when you load stuff that does not have homogeneous types?
You have to load it as Dynamic/Text
But then what would be the signature for the getColumn function?
Marco Zocca @ocramz 12:15
rather than dynamic, a sum type that contains all the possibilities
Tom Nielsen @glutamate 12:17
ok, i need to dig into this.
@NickSeagull which module is getColumn in?
and which branch
Nikita Tchayka @NickSeagull 12:17
It is not yet defined @glutamate
I'm taking as an example http://bluemountaincapital.github.io/Deedle/tutorial.html which is a widely used DataFrame library for F#
My intention is not to copy it, dont get me wrong
But imitate the API in some aspects, as it is very nice in my opinion
Tom Nielsen @glutamate 12:18
sure. Couldnt you still do getColumn if Dynamic is underlying?
and if you stayed with Vector (Vector v) getColumn could take a function (v -> a)
Marco Zocca @ocramz 12:20
@glutamate does Dynamic provide better speed or memory use?
Tom Nielsen @glutamate 12:21
@ocramz pretty sure it would be better than Text as you at least don't have to parse
just untag
Nikita Tchayka @NickSeagull 12:21
So maybe then the loadCSV functions must somehow deduce the type?
Tom Nielsen @glutamate 12:22
if you are not going to reveal the underlying value, we can do much better though. instead of tagging individual vals, tag only the column
Nikita Tchayka @NickSeagull 12:22
So we would have a vector of columns instead of a vector of rows?
Tom Nielsen @glutamate 12:23
yes, i think so? thats how data.frame in R works
i thknk
not sure about loadCSV
Marco Zocca @ocramz 12:27
so this loadCSV would need to be some sort of scanner that digests a bit of data and produces a parser for it
a "parser generator"?
Nikita Tchayka @NickSeagull 12:28
hmm
but what if the user loads a CSV with 800 columns
Marco Zocca @ocramz 12:29
eh
Nikita Tchayka @NickSeagull 12:29
And does not want writing a type
Marco Zocca @ocramz 12:30
so, what are the invariants? each column carries one type hopefully?
Nikita Tchayka @NickSeagull 12:32
Yes, but that would be an heterogeneous list
But instead of using HLists one can use Bookkeeper
Which at the same time can be used somehow as a dataframe, but I dont know what the issues on performance are
Thats why I wanted to go with plain Text. Look at this dataset http://archive.ics.uci.edu/ml/datasets/heart+Disease. It has 75 columns, and most of the time one would discard most of them. The problem is, one cannot load it without providing types before doing so.
Marco Zocca @ocramz 12:36
but what if we said data Entry = I !Int | R !Double | B !Bool ... and required that data Row = Vector Entry
ok, no higher-kinded types
Nikita Tchayka @NickSeagull 12:38
@ocramz there could be an ... | Untyped !Text | ... alternative maybe?
Marco Zocca @ocramz 12:38
possibly
ok this is all value level and not pretty
Tom Nielsen @glutamate 12:39
@ocramz @NickSeagull my opinion is that we need one very typed data frames like Frames and one very untyped. a key use case is when you as you said have 75 cols and you want to iterate over them
e.g. turn everything in to predictor; do dimensionality reduction etc
Nikita Tchayka @NickSeagull 12:40
Exactly
About value level @ocramz , I'm not against it, most people work at value level in other languages, and it would be an easy switch for them, which is one of the things that we want too 
Tom Nielsen @glutamate 12:41
when @avctrh wakes up he will have have opinions on this too.
Nikita Tchayka @NickSeagull 12:43
I like the 'very typed' frames alternative, but the thing is that I still dont see the use cases there. In the end one could just make a decodeRows :: (Row -> a) -> Frame r c -> a right?
Marco Zocca @ocramz 12:43
@glutamate at any rate there will be one representation as a grid of Doubles
Nikita Tchayka @NickSeagull 12:43
Maybe it's just because I use data frames mainly for data exploration
Marco Zocca @ocramz 12:44
specifically for doing regression etc.
Marco Zocca @ocramz 12:50
yep @NickSeagull there should be multiple encoder/decoder function pairs
Nikita Tchayka @NickSeagull 12:50
Multiple? Just encodeRow and decodeRows maybe?
Marco Zocca @ocramz 12:51
well one for every internal representation
I think we'll have to drop down to unpacked types for efficiency at some point
is it a Prism? http://hackage.haskell.org/package/lens-4.15.4/docs/Control-Lens-Prism.html#v:prism
Nikita Tchayka @NickSeagull 12:52
I've used Lenses in the past, but not very in depth, so it might be
The thing is that I wouldn't like to expose lenses in the API
Would scare the ** out of a newcomers
data FrameOperation r c s t a b
If I saw that night time in my house's corridor 3 years ago I would definitely call the police
lol
Marco Zocca @ocramz 12:59
true dat
Nikita Tchayka @NickSeagull 13:18
Also, I want to improve our current website with guidance for everyone who comes. Not just "we are nice and we do this", but also stuff like "Come to collaborate on analyze, foo, bar, ..., there are beginner friendly issues like ..."
Any ideas?
Tom Nielsen @glutamate 13:18
@ocramz if the data representation were Vector (Vector v) you'd get your grid for free. you could even implement onlyDoubles :: Frame k Value -> Frame k Doubles dropping all non-Double cols
Nikita Tchayka @NickSeagull 13:19
Btw, the type of frame would change from Frame k v to Frame rowKey colKey value
So we can combine multiple Series k v to get a frame
Tom Nielsen @glutamate 13:20
ok
Marco Zocca @ocramz 13:57
@glutamate rather than dropping columns I was more thinking of computing real-valued features from say string or categorical data
Tom Nielsen @glutamate 13:57
oh yeah we should definitely do that too
NickSeagull commented 7 years ago

@ocramz @glutamate what do you guys think about storing columns the following way:

data Column
    = I !(Vector Int)
    | R !(Vector Double)
    | B !(Vector Bool)
    | ...
    | Untyped !(Vector Text)

data Frame = Frame
    { _frameColumns :: !(Vector Column)
    , ...
    }
tonyday567 commented 7 years ago

If you use Vector, I think that precludes any chunking or streaming-like solution for very large data sets. I'd see this type of structure as Frames.Strict, implying a Frames.Lazy that deals with chunking. Obviously the Text and Bytestring APIs are inspiring the idea.

Perhaps the container (Vector) could be polymorphic?

NickSeagull commented 7 years ago

That is a good idea @tonyday567, the thing is that one should then reimplement all of the operations based on the container, right?

I was thinking for a first version to use vector-algorithms and statistics as they all operate on Vector and make it faster to go for a first version.

NickSeagull commented 7 years ago

On the other hand, if we were to use a streaming package like Conduit, or Streaming, how would one for example sort a dataframe without loading it all?

glutamate commented 7 years ago

I started some experiment here: https://github.com/glutamate/analyze/blob/playground/src/Analyze/New.hs

tonyday567 commented 7 years ago

sort a dataframe without loading it all?

If your input doesn't fit into memory then you won't be able to do an in-memory sort. Map-reduce methods are what you do in practice - there's usually a sort between the map and the reduce.

Looking at the @glutamate experiment, I think this will resolve naturally as what you can do with a FrameContainer and what you can do to a Frame.

NickSeagull commented 7 years ago

Yes, definitely. But I was rather thinking about when one wants to work with small data. Like a 500mb CSV file

tonyday567 commented 7 years ago

Added an arbitrary instance to the @glutamate experiment, mostly to build some intuition. Worked well. sample (arbitrary :: Gen (Column []))

I figured a Show dependency was ok - couldnt imagine a csv field that wasn't a Show.

https://github.com/tonyday567/analyze/blob/arb/src/Analyze/New.hs

NickSeagull commented 7 years ago

Looks great @tonyday567 , also about sorting: It might be feasible to implement external sorting