JuliaData / TypedTables.jl

Simple, fast, column-based storage for data analysis in Julia
Other
145 stars 25 forks source link

Documentation and examples #23

Open andyferris opened 6 years ago

andyferris commented 6 years ago

This package needs a user guide.

andyferris commented 5 years ago

There's some work-in-progress using Documenter.jl and github-pages now - you can find this with the "latest" documentation badge from the README.

c42f commented 5 years ago

Hey Andy,

I wrote the first two doc review parts to you in an email... but thought I'd continue here (for the syntax highlighting etc). I'm up to IO:

Input and output

Tables.jl

I'm not sure what information you're trying to convey in the Tables.jl section. Are you saying that Table and FlexTable integrate with it? It's not clear how this is related to IO until you get to the section on CSV.

Perhaps talking about the how TypedTables relates to the Tables API should go into the Table types section?

CSV.jl

Now here's where it gets really practical. I'd suggest renaming this section something like "Reading delimited text files into a Table", and include an example with the actual data which can be pasted into the REPL, perhaps. For that you can put the string containing the delimited data inline:

raw_data = IOBuffer("""
name,age
Alice,25
Bob,42
Charlie,37
""")

csvfile = CSV.File(raw_data, delim=',')

table = Table(delimfile)

BTW the following is an interesting read https://github.com/JuliaData/CSV.jl/issues/340

c42f commented 5 years ago

BTW, I'm continuing to love the simple composable design of this and related packages. At this stage it just needs good documentation to clearly show how nice it all is!

c42f commented 5 years ago

Thoughts on the next section

Basic data manipulation

Mapping rows of data

Using map

Extracting a column - perhaps too simplistic given that there's another much better way to do this using t.name? I think you could probably cut this example.

How about this example of adding a new column?

julia> map(row -> (row..., is_old=row.age > 40), t)
Table with 3 columns and 3 rows:
     name     age  is_old
   ┌─────────────────────
 1 │ Alice    25   false
 2 │ Bob      42   true
 3 │ Charlie  37   false

Generators

This is nice.

Generators and comprehensions also support filtering data and combining multiple datasets, which cover in Finding Data and Joining Data.

Preselection

You're right, I couldn't see the point of getproperty(:name) out of context.

Finding data

Example can be expressed as

t[t.age .> 40]

I guess this is a frustration I'm having with a lot of the examples in the mapping and finding sections — it's good that they're simple, but on the other hand they're kinda unrealistically simple in the sense that you woudln't express the code that way in practice. IMO the examples should be just complicated enough to show idiomatic use. Easy to say, I know.

c42f commented 5 years ago

Next section... one thing which strikes me here is that you're not really documenting TypedTables per se? But the Andyverse of data analysis? Which makes it kind of odd that the documentation is in TypedTables.jl.

Another thing which strikes me is that the grouping and joining sections seem quite polished. I especially enjoyed the grouping, which looks like it would address some of my frustrations having used the DataFrames groupby.

Grouping data

Spelling: Groupind in the index on the left

Using the group function

Ok, now I see why your curried version of getproperty is worthwhile. Perhaps you could link between the sections where getproperty is introduced/used. Actaully the curried getproperty should be arguably be in Base.

Lazy grouping and Groupreduce

Very nice, you've got all the things!

Joining data

I do wonder whether product might better be named crossjoin. Not because I partiularly like the latter name, but mainly because product is such a generic name, and crossjoin has better symmetry with innerjoin. Though product returns data which is naturally cartesian product shaped in contrast to leftjoin...

Left-group-join

I thought the intro to this section could describe the operation itself rather than the analogy with SQL or LINQ (which I'm not very familiar with).

Acceleration indices

However, the second "magic" ingredient used by an RDBMS for performance are secondary "acceleration indices", which are pre-calculated views of the data.

I'm not sure the database people would agree, I get the impression that disk layout and caching are quite important ;-) Also, views can be built using indices, but they're not the same thing.

The user is free to write generic code to execute their query, and the presence of the acceleration index will only act to speed up [...]

I think it's worth making the point that this is also the power of indices in SQL: you can add them to speed up the execution, but they are a performance tool and the query stays the same. In the same way, in juila the code which manipulates the arrays stays the same but things go faster. It's a great composable design.

Ok I think I've read through all the docs. Overall, great stuff! I want to start using these packages ASAP.

c42f commented 5 years ago

Random (likely useless) thought bubble — having just written that AcceleratedArrays is great because it decouples performance from query semantics — could we think of the types in Table in a similar way; extra data to improve performance, but which might be missing? Thus somehow folding FlexTable and Table together?

andyferris commented 5 years ago

Haha - that's kind of interesting, actually. Until now I've been thinking of FlexTable as the slower Table. Should we define decellerate(t::Table) = FlexTable(t)? :smile:

andyferris commented 5 years ago

And thank you very much Chris for the valuable feedback! (I now have to find the time to make some fixes).

one thing which strikes me here is that you're not really documenting TypedTables per se? But the Andyverse of data analysis? Which makes it kind of odd that the documentation is in TypedTables.jl.

Yes... well that is kind-of true. They were developed quite specifically to work together - a kind of "native" and "Julian" relational algebra interface. SplitApplyCombine deserves better documentation of it's own (and I'd like to port it to Base).

Another thing which strikes me is that the grouping and joining sections seem quite polished. I especially enjoyed the grouping, which looks like it would address some of my frustrations having used the DataFrames groupby.

Thanks for the feedback! It's fair to say that SplitApplyCombine exists specifically because there is no Base.group, so yeah this is the bit which I definitely feel the most strongly about and have thought about the longest.

c42f commented 5 years ago

Should we define decellerate(t::Table) = FlexTable(t)?

Not quite what I was thinking :-) More like trying to define

const FlexTable{N} = Table{Placeholder, N, NamedTuple{<:Any, <:Tuple{Vararg{AbstractArray{<:Any,N}}}}}

where Placeholder might be Nothing or NamedTuple undecorated with column names and types. Or something. And trying to see if that can lead anywhere productive.

andyferris commented 5 years ago

Regarding writing documentation, I liked this blog: https://www.divio.com/blog/documentation/

I feel like I should better factor my tutorials, explanations and how-tos. (At least the reference material is naturally docstrings in Julia).

c42f commented 5 years ago

:100: That's a really interesting article for framing the discussion around documentation. It's very interesting that they insist that these four types of documentation are really separate and should be written separately. In my mind, I suppose there were only two types: prose which has to function as all of tutorial, howto and explanation. And technical reference (docstrings).

In the language of the article, I'd say several sections of the TypedTables documentation had too much explanation. I think I made the same mistake with the Logging documentation which probably makes it read more like a design document than a practical guide. It was so much work! And yet people are still (understandably) confused about how to use it! Ack!

andyferris commented 5 years ago

Yes, agreed. I already began a rewrite to create a much more focussed tutorial. Interestingly, this starkly highlighted a couple of the (known) missing features, so I'm looking into these as I go.

I'm not certain how to phrase the left-over design explanation without it being just a rant. Anyway; iterate, iterate, iterate...

c42f commented 5 years ago

Yeah the explanation of a design is hard to write and make useful. The abstract design arises from a bunch of concrete use cases and practical constraints... but writing those down without any organization leads to a pile. On the other hand, remove them and it feels like you're writing fluff without justification. Kind of like a rant, yes!

Maybe it would help to try to name the dimensions of the "use case space"? A given design satisfies the needs of a bunch of use cases, and so fills out some nontrivial volume in that space. At the boundaries of the volume are some particular extrema which the design only just satisfies... are these the use cases which matter and are worth discussing to keep things concrete?

On the other hand there's the design space and performance spaces, which (looking at the literature) seems to be more standard concepts. But for software design the design space seems rather high dimensional, poorly defined and combinatoric rather than continuous. Probably like most real world design problems...

Oh, I'm sure some category theory will help us out. (Um. I have only the vaguest idea of what that paper is proposing.)

c42f commented 5 years ago

Oops, got the link wrong... here's the paper which talks about using category theory for Formal Design.

andyferris commented 5 years ago

Thanks. Unfortunately, I haven't the time to look over something so... dense... at the moment ;)

andyferris commented 5 years ago

Chris - there's a new "tutorial" section up now, and a basic API reference. The remainder of the docs still need refactoring. But I think I'm much happier with the tutorial - to me it now doesn't seem significantly worse than getting started guides for DataFrames.jl, Python Pandas, R's Data.Table, etc.

Having _.name syntax in Julia 1.1 instead of this package using getproperty(:name) would make it nicer (I had to add an "explanation" to the "tutorial", shudder).

c42f commented 5 years ago

I do wonder whether _.name will get into 1.1. The issue of binding tightness is really thorny. Reading back on the issue, I'm rather dissatisfied with tight binding and Stefan's counter proposal seems better but is quite complicated and lacks an implementation.

andyferris commented 5 years ago

I agree. It all seems thorny enough to sink it (or at least delay it signficantly).

c42f commented 5 years ago

Well, you totally nerd sniped me with the underscores business... Now there's MagicUnderscores.jl. You're the first to see it :-P