reductio cap, postprocess group.all method with postprocess.js

nordfjord commented 9 years ago

Sorry for bombarding you with PR's I've just been having fun tonight, poking at these problems.

Firstly, I thought reductio was missing a postprocessing feature, something that allows us to edit the data coming out of the group.all function (useful for removing all 'unknown' keys for instance)

So I added postprocess.js which is called when the group already has an all method so that we may override it.

Then you have the cap.js file which contains a function to modify group.all so that it returns an array of length cap and whose last element is an aggregation of all elements with index >= cap I wrote support for sum avg and count in the cap function.

See the tests for usage.

gordonwoodhull commented 9 years ago

This looks like it could potentially cover the Fake Groups (non-)pattern in a more graceful way. It would be great to find a home for that code, whether it's in reductio or in some postprocess library.

(And yep, caps really shouldn't be part of dc.js; I think if there were a hook to deal with the Others special case better, there would be no need for that mixin.)

esjewett commented 9 years ago

Thanks for the PRs. I think adding the ability to run post-processing functions to Reductio would be great, and having cap and downsampling procedures available makes sense. I just think we need to think through where to put this in the API and the semantics of it.

Right now I have some basic thoughts on a 1.0 API for Reductio here: https://github.com/esjewett/reductio/blob/master/NOTES.md

For post-processing stuff, I think it would make the most sense to have these as methods on the returned group. So groups that go through Reductio would have the usual group.all method, but they would also have a group.post object providing access to functions like group.post.cap and group.post.downsample.

The reason I'd prefer to see these on the group rather than the reducer is that they are evaluated when called, not when the group is defined. That is, they are query-time, not load/filter-time. The semantics are therefore quite different than the reducers. For example, if you do ...

var group = dimension.group();
var reducer = reductio().sum(function(d) { return d.a; })(group);
reducer.sum(function(d) { return d.b; });
var result = group.all();

... then result contains a sum over a, not a sum over b. In other words, once a reducer is applied to a group, changing the reducer does not change the group. Further, a reducer can be applied to more than one group and the groups won't interfere with each other. However, the post-processing semantics are different, so I'd prefer not to have them on the reducer object.

Does that make sense? Would you be willing to rewrite the pull requests to put the post-processing logic onto the group object as suggested?

Thanks so much.

nordfjord commented 9 years ago

Alright, point taken, I suggest we make it like so then:

var reducer = reductio()
  .sum('foo')(group);

group.post.cap(3); // Array of length 3

This allows us too change the variables on the post processing methods at runtime while still allowing us to create multiple groups from a single reducer.

nordfjord commented 9 years ago

Or we could even make that chainable:

var reducer = reductio().sum('foo')(group);

group.post.sortBy('sum').cap(3)();

nordfjord commented 9 years ago

Alright I still haven't made it chainable, but man that would be killer.

Anyways as it stands now the API for it is:

var reducer = reductio().sum('foo')(group);

group.post.cap(3);
group.post.cap(3, 'My Others Name');

nordfjord commented 9 years ago

Right so, I added a method to reductio (global)

reductio.registerPostProcessor('before', function(prior){
  return function(before){
    return prior().filter(function(d){return +d.key < +before;});
  }
});

// reductio post now has a before method
var reducer = reductio().sum('foo')(group);
group.post().before(new Date())(); // returns all elements that are before now

Then there's the cap postprocessor, which is registered through that method and can be used as so:

var reducer = reductio().sum('foo')(group);
group.post().cap(3)(); // returns the top 3 elements of the grouping where the third is an aggregation of the rest

nordfjord commented 9 years ago

I added a sortBy postprocessor as well.

nordfjord commented 9 years ago

So with this PR you can do:

var reducer = reductio().sum('foo').count(true).avg(true)(group);

var sort_descending_and_cap = group.post().sortBy('-value.sum').cap(4);
var sort_ascending_and_cap = group.post().sortBy('value.sum').cap(4);
console.log(sort_descending_and_cap());
console.log(sort_ascending_and_cap());

The logs will be different since the postprocessing is applied seperately and simply return functions which work on the crossfilter group.all method.

esjewett commented 9 years ago

Starting to look good. I think I might want to remove the functionality in sortBy where it has a lot of intelligence built into the string it is passed and just use an accessor function plus the standard Reductio "if I see a string it is a property name" approach. I can make that change after merging the pull request.

Sorry that this might take a little while. I'm traveling and/or on vacation all week and much of next, but will try to get this merged and a new release cut eventually.

esjewett commented 9 years ago

I should also say, I like the registration approach. I'm now thinking if I can do something similar for all the reducers. Would make it quite a bit more encapsulated.

nordfjord commented 9 years ago

I added the extra intelligence on the sortBy processor since most of the time you're accessing values 'two levels' deep. So in order to allow a string to be passed it required the ability to handle strings like 'value.sum'.

It's no problem if you want it removed, just wanted to give you my perspective on why it's there.

esjewett commented 9 years ago

@nordfjord My thinking is that it should work like group.order. The accessor function is passed the contents of the group value. If it's a string instead of a function (i.e. group.post().sortBy('sum')), we interpret that as value.sum. For things that are deeper in the value object, people would just have to use accessor functions.

If there is another library somewhere that handles converting these types of strings to accessor functions in a minimally confusing sort of way, I'd consider using it.

nordfjord commented 9 years ago

Right, That's where we differ, my thinking was that you sometimes sort by key value (e.g. alphabetical table) and sometimes by value (e.g. top-N table).

Though I tend to agree with you that in it's current state the pluck_n function is confusing, and that alone is be reason to abandon it.

esjewett commented 9 years ago

Hmmm. That's true. Need to allow access to the whole group including the key to sort. Ok, let's back up.

I'll probably pull the pluck function in the sortBy into somewhere I can use it generally in Reductio for this string->accessor conversion? That is, also use it to replace the accessorify function here: https://github.com/esjewett/reductio/blob/master/src/accessors.js#L11

But can we just dump the part where we use '-' to indicate ascending or descending sort? That's semantically correct for numbers, but not for strings. Probably makes more sense to have sortBy take a second parameter that would accept an ordering like d3.ascending or d3.descending (or something else entirely), defaulting to ascending. So group.post().sortBy('value.sum', d3.ascending) is equivalent to group.post().sortBy('value.sum') and is reversed from group.post().sortBy('value.sum', d3.descending). What do you think?

nordfjord commented 9 years ago

I agree using the string to indicate sorting order is in general bad. Your proposal for an ordering function is cleaner.

In fact I think it's much cleaner from a programming perspective. I'll get on it when I'm off work.

That is sortBy would be of the form (accessor, order)

Where order is of the form (a,b) and returns -1, 1 or 0

The comparator function would then simply call this ordering function.

That should simplify things, and make the processor more powerful at the same time.

nordfjord commented 9 years ago

I'm however worried about adding a dependency on d3 just for the d3.ascending function unless you're planning on using it for more. Here is a simple ordering function for ascending:

function ascending(a, b){
return a < b ? -1 : a > b ? 1 : a >= b ? 0 : NaN;
}

which is probably similar to/the same as the d3 method.

esjewett commented 9 years ago

Yes, just using d3 as an example. I'd say let's just use a generic ascending ordering as the default. I'd just like the ordering expected to be compatible with the d3 ordering interface because it's so widely used (and makes sense).

Overall, nicely done and thanks for the pull request :-) I still need to look at the tests, but I think we're probably OK to merge after that change. I might still make some changes after the merge, but let's go for it.

nordfjord commented 9 years ago

Oh by all means look at the tests, writing tests is one area I'd like to improve in.

I'll raid the d3 codebase for their ascending function and use that as default.

I'll get on it after work.

regards,

Einar

esjewett commented 9 years ago

Cool - I just checked and d3's matches what you have above. Pretty standard, I'd imagine.

nordfjord commented 9 years ago

alright, so the '-' syntax is removed, and a d3.ascending 'like' function used as default comparator.

I don't have time for more at the moment, but let me know if you want me to assist in any regard.

esjewett commented 9 years ago

Looks good to me. Now it's just a matter of me finding time to review it and merge. Thanks so much! I think this will be a really good addition to Reductio, and hopefully helpful to people using dc.js as well as other libraries.

@gordonwoodhull Do you know of other common "fake group" patterns that we might want to support here?

gordonwoodhull commented 9 years ago

Once this is in I'll do a PR for most of the fake groups in the dc.js FAQ.

esjewett commented 9 years ago

Ok merged. Sorry for the delay on this and thanks again for the contribution. I'll probably be in here making some minor changes and testing later this coming week and then I'll do a release including the post-processor stuff.

crossfilter / reductio

reductio cap, postprocess group.all method with postprocess.js #22