add more statistical transformations

dkochmanski commented 4 years ago

In branch specify-layered-grammar a protocol for statistical transformations is defined (method statistical-transformation and a protocol class ).

All statistical transformations (except of and which is a fallback) should be defined in Source/statistical-transformations.lisp -- task requires looking at ggplot2 and picking the most useful ones at first with their parameters, implementing the transformations and adding tests and documentation for them - used aesthetics, added columns / rows or however they modify a data frame.

It is worth to think about a convention for naming new columns. If any problems with the protocol it may be changed to mitigate them.

skempf commented 4 years ago

Should I create a new pull request with this, or do you want me to add it to the existing pull request #2?

I looked through ggplot2 and took some notes on the various transformations, and prioritized them (simple a-b-c org-mode priorities). The line between geom/layer/stat seems kind of blurry. I created a doc-statistical-transformations.org file with my notes, intending on slowly turning it into the actual documentation. I can upload it, or just keep it here until it is finished.

The more I think about this, I think we should develop a specification for the layered grammar. I have been mulling this over this evening (which you probably saw a bunch of emails as I edited these comments, sorry). Obviously the underlying df cannot be modified, or you (could) loose the data -- e.g. for a histogram we will have a different number of rows, based on the number of bins, or with the nudge/jitter the xy is shifted, etc. It seems like the layer is the best place to store a pointer to the incoming data-frame, which is how you have it now. And that the generic function statistical-transformation returns a new df.

I also think that the classes have to be named with superclass-subclass. For example, both geom and stat have a bin subclass.

dkochmanski commented 4 years ago

answering the first question: please push directly to a branch associated with existing pull request. I won't change history until we have all done, then we will rearrange commits (when all is in place). for that reason please keep commits local to one file (i.e do not commit changes to both lisp and org file or to two different lisp files in one go - that will make juggling with commits harder)

please include org file in doc

I agree that we should have a specification, some things will come up while implementing but the goal is to have a documentation, a protocol and an implementaiton

note that aest may be part of stat, but also geom may have its own mappings (and other objects too I suppose), so naming the reader for all simply %aest or aesthetics or mapping (or the constructor make-aest) may be better so there is a single function to probe aest from the object

class name conflict is a problem. otoh it would be nice to not prefix them with the superclass name. I need to think about it more. one mundane solution I can think of is:

(defvar <bin> '<bin>)
(defun geom (type &rest args)
  (case type 
    (<bin> (apply <geom-bin> args))
    (otherwise (apply type args)))

and other conflicting names resolved in this spirit. that way we could benefit from the fact that we construct objects with appropriate functions (eitehr geom, stat or something else).

skempf commented 4 years ago

I see. So just a simple function like

(defun %aest (component)
  (slot-value compoent 'aest))

should work. I thought about making it a method that dispatched on the various components, but I can't see a reason to have more than one method; something like

(define-class <chart-component> () () 
  (:documentation 
   "All chart components inherit from this class."))

(defgeneric %aest (<chart-component>))

I don't think we would need to dispatch this function on the different components though, but we should probably have a base <chart-component> class that everything inherits from?

skempf commented 4 years ago

I've been working on the bin statistical transformation to help me flesh out overall concept a bit. I have come across a need to store some information, and I would like to store it in the <stat-bin> instance. This would mean, though, that you couldn't re-use a stat-bin, but would need a new one for each data-frame. Reading through ggplot, I get the impression that components like stat and aest are df independent. Where do you see the storage -- in the geom instance, or the layer instance?

It seems to me, it either has to be the stat or the geom object. The layer object needs to be independent of the type of geom object it receives.

dkochmanski commented 4 years ago

alternatively we may take a "dwim" road and define function aest as this:

(defun aest (&rest plist)
  (if (alexandria:sequence-of-length-p plist 1)
      (slot-value (pop plist) 'aest)
      (loop for (i j) on plist by #'cddr
            collect i into aes
            collect j into var
            finally (return (<aest> :aest aes :vars var)))))

dkochmanski commented 4 years ago

what kind of information do you need to store? could you give me a few examples?

skempf commented 4 years ago

Hold that thought. For bins, I might be able to get around it, but maybe not for boxplots. I'm still prototyping the bin though.

I have a working transformation for bin. The major issues are:

does not deal with date / time. How did you want to handle dates and times? I didn't know if you had a preferred package, or way to do it. ggplot uses days for dates and seconds for time by default. Or do you want to ignore this for the moment?
limited user inputs (fixed uniform bins with left edge in bin and no padding)
interface to data-frame protocol is messy; it is easier to create the bins and frequencies with arrays, and I tried to hammer it into a map-function, but it is a hack. Rather than consing everything, I would like to make a make-data-frame that accepts the rows as arrays. Are you OK with that?

dkochmanski commented 4 years ago

I didn't put much thought into dates. As of preferred library, I'm moderately versed with local-time library so I'd prefer that if we want to jump at dates now. I'd postpone that for later though.
please elaborate what the issue is (maybe with some examples)
how about specializing copy-data-frame on arrays (first row being interpreted as columns)?

skempf commented 4 years ago

I agree; lets postpone dates for later.
This is more of a fact; the current prototype has limited user inputs. The prototype needs some cleanup, and I need to decided about some helper functions. Then I'll start adding user inputs. Do you want to primarily replicate the ggplot2 interface?
Comments in #2.

dkochmanski commented 4 years ago

not necessarily, we are inspired by ggplot but if some change makes sense lets do it. also we may have only the most important options implemented first and when we have a protocol for stats, geoms etc we will deepen the implementation.

skempf commented 4 years ago

Perhaps more important than an array-data-frame may be some methods to quickly access and set specific selections (or perhaps that is the point of the array-data-frame). I tried to refactor the bin transformation to use map-aesthetics and add-rows! starting from an empty data-frame, but this is also difficult ... probably for a similar reason that you setup a hash table on the count stat, much easier to grab the right location and incf the frequency.

However, I'll be the first to admit, I think more linearly, like a Fortran number cruncher, so perhaps there is some slick solution I haven't thought of yet. I'll cleanup what I have and push it so you can see some of the things I am talking about.

Something else to think about is that when we add a column, or in my case a new data-frame, we need to modify the aest as well to reflect the new (or adjusted) data-frame.

(stat 'bin (aest :x "Salary") :count 29)

results in a new data-frame with 29 rows where x is "Salary" and y is "Frequency" or count or some such thing.

skempf commented 4 years ago

I made a change to the stat-count transformation to return a new data-frame with only the unique :x values. This is consistent with R; not that we necessarily have to do it that way, but it certainly makes sense to me.

One thing I was concerned about is what if you decided you wanted to facet the diamond price histogram on color. I played around with ggplot some more, and although you can specify the histogram and facet in any order, the actual stat is returning a new data-frame with only the relevant columns. As an example with a price histogram, stat_bin is returning a new data-frame with the bin centers and the frequency counts in those bins as x and y respectively. If you save that and add a facet, there is logic to reprocess the original data-frame and send the filtered data to stat_bin, which again, returns a new data-frame with x and y.

TurtleWarePL / Polyclot

add more statistical transformations #3