add basic directory structure and initial documentation

dkochmanski commented 5 years ago

data frame protocol specification and stub implementation

dkochmanski commented 5 years ago

@skempf @slyrus if you have feedback on this please leave them in form of a review

dkochmanski commented 5 years ago

@sirherrbatka ^

sirherrbatka commented 5 years ago

This protocol honestly looks fine for it's intended purpose. The only thing that bothers me, and it is not because it is objectively wrong, is that only strings can be used as column/row names. I already decided to use symbols for this purpose instead. Would it be possible to extend the name type to be (or string symbol)? This would make it slightly more complicated because nil is a symbol. In fact, I am starting to consider revisiting my own decision now…

Do you plan to make some kind of spreadsheet viewer for the frames as well? I would be very interested in that!

dkochmanski commented 5 years ago

what would be a benefit of using symbols? I can see only eql comparisons, but if we allow also strings, then you need to use string= anyway. NIL is not a problem here.

Note, that I've dropped notion of named rows based on yours and Kabriel's suggestions.

Regarding spreadsheet viewer, crude implementation is easy:

(in-package clim-user)

(defun show-data-frame (clim-stream data-frame)
  (formatting-table (clim-stream)
    (formatting-row (clim-stream)
      (surrounding-output-with-border (clim-stream :background +grey+
                                                   :padding-top 0
                                                   :padding-bottom 0)
        (formatting-cell (clim-stream)
          (princ "id" clim-stream))
        (map nil (lambda (column-name)
                   (formatting-cell (clim-stream)
                     (princ column-name clim-stream)))
             (eu.turtleware.polyclot:cols data-frame))))
    (eu.turtleware.polyclot:map-data-frame-rows 
     data-frame t
     (lambda (row-index row)
       (formatting-row (clim-stream)
         (formatting-cell (clim-stream)
           (print row-index clim-stream))
         (eu.turtleware.polyclot:map-data-frame-cols
          data-frame row t
          (lambda (col-index col-name value)
            (declare (ignore col-index col-name))
            (formatting-cell (clim-stream)
              (print value clim-stream)))))))))

(define-application-frame poor-mans-spreadsheet ()
  ((dataframe :initarg :dataframe :reader dataframe))
  (:pane :application :text-margins '(:left 10 :top 5)
   :display-function (lambda (frame pane)
                       (show-data-frame pane (dataframe frame)))))

(run-frame-top-level
 (make-application-frame
  'poor-mans-spreadsheet
  :dataframe
  (eu.turtleware.polyclot:make-data-frame
   '("name" "col1" "col2" "col3")
   '("row1" value1 value2 value3)
   '("row2" value1 value2 value3))))

obraz

With slim utilities that would be even shorter.

sirherrbatka commented 5 years ago

As for symbols, it would take me a moment to explain exactly why but it made implementation slightly easier for me. I agree that for purpose of this protocol there is no advantage.

I will need time to digest spreadsheet example but thanks! As simplistic as it may be, It should be usefull.

dkochmanski commented 5 years ago

I've added another commit with a tutorial

skempf commented 5 years ago

Are you force pushing updates? Does that mean when I check out, I need to rebase? I'm not making changes on my end, just reviewing, but every time I checkout, it has merge conflicts.

dkochmanski commented 5 years ago

I am. please try

git fetch origin
git reset --hard origin/branch-name

Note that I'm rebasing the pull request (not the master branch), to have a clean merge. I wouldn't change master history.

dkochmanski commented 5 years ago

I've rebased once more. I've added also stub tests.

skempf commented 5 years ago

I'm not sure how long you want to keep it in review. I've read through some of the documentation several times now, and I forget what has changed and what has not, because I can't diff a previous commit. There are a few grammar errors, but the github gui makes it too difficult to fix that. I could send you a modified org file or something.

dkochmanski commented 5 years ago

sending a diff will work fine for me. regarding review, I want to have at least basic tests for all functions mentioned in the spec.

skempf commented 5 years ago

That leaves the map-X functions and add-X! functions. I might have some time this evening to work on it, if you don't have it done by then!

dkochmanski commented 5 years ago

Sounds good to me, I'll push sel and ref in two hours

dkochmanski commented 5 years ago

I've pushed (without rebasing to not introduce unnecessary merge conflicts to you -- I'll squash in the end) tests for ref and sel. They show some problems for invalid data.

skempf commented 5 years ago

gen-valid-row produces a row of single type data using a random index into a case table. When generating multiple rows, this function is called multiple times and produces rows of varying types (from row to row, but single type within the row). I had an issue with this, because when adding a new column with add-cols!, there are some instances where this throws an error. One example is if a given row is a bit vector, and then the new column data is a string, it fails trying to vector-push-extend the string into the bit vector.

Really, data-frames should be column type consistent, such that a given column is of the same data type.

Or it could be that when you said "row-based" originally, you meant to have a data-frame that is 90 deg rotated from what one might normally define (or at least that I'm used to). I have some ideas, but this question has to be answered first. Not sure if it is easier to discuss this here, or over irc.

Here is an example of a add-cols! test that shows the problem

(def-test add-cols! ()
  (for-all ((width  (gen-integer :min 2 :max +max-cols+))
            (height (gen-integer :min 1 :max +max-rows+))
            (new-width (gen-integer :min 1 :max +max-cols+)))
    (let* ((df (funcall (gen-data-frame height width)))
           (new-col-spec (funcall (gen-valid-cols new-width :prefix "new")))
           (new-col-data (funcall (gen-valid-rows height 1)))
           (name-fun-pairs (loop
                             :for ind :from 0 :below new-width
                             :append (list (elt new-col-spec ind)
                                           (lambda (row-index row)
                                             (declare (ignore row))
                                             (elt new-col-data row-index))))))
      (apply #'add-cols! df name-fun-pairs)
      (is (= (nth-value 1 (dims df)) (+ width new-width))))))

A dirty fix is to add :element-type t to the row copy-array in make-data-frame. But I really started thinking about the end use. As I currently understand the specification, I'm going to want to plot columns (not rows). It seems that we need better access to views of columns, and it would likely be better to arrange the data as columnar arrays rather than row-arrays.

If you want it to be row-based, with homogeneous rows, then the rows need to be named and the columns would represent data points. I'm UTC-5, so I might catch you early morning, or can try to chat your tomorrow afternoon.

skempf commented 5 years ago

I tried to un-approve my approval. It looks like I requested myself to review it again. Sorry if I messed it up.

dkochmanski commented 5 years ago

let's consider i.e a scatter plot, with rows as points I see it as:

|  x |  y |
|----+----|
|  1 |  2 |
|  3 |  1 |
|  4 |  5 |
| 12 | 42 |

plotting it would be something in this spirit:

(map-over-data-frame-rows df t
  (lambda (index row)
    (draw-point (ref df row "x") (ref df row "y"))))

Now we could plot a time series, where Y is the value and X is ignored:

(let ((x 0))
  (map-over-data-frame-cols df t "y"
    (lambda  (cind cname value)
      (draw-point x value)
      (incf x))))

how would it look like with the alternative representation? I think that a few examples of a use case would help me to understand the use case better. We can modify the specification to allow frames with different arrangement; also it is not that we are not allowed to change the specification when we encounter problems with it later down the road.

I gather you mean something like this?

| name | p1 | p2 | p3 | p4 | ... | pn |
|------+----+----+----+----+-----+----|
| x    |  1 |  2 |  3 |  4 | ... | n  |
| y    |  1 |  4 |  9 | 16 | ... | n² |

It is worth noting, that specialized data frames are not prohibited by the spec, it is only that they are left unspecified. i.e we could create a class <readonly-specialized-data-frame> which errors on add-rows! and add-cols! suggesting that operations should be performed on the data frame copy (and copy-data-frame would return an "ordinary" <raw-data-frame>) - then all rows (or cols) could be specialized non-adjustable vectors.

dkochmanski commented 5 years ago

these pushed merges with "solved conflicts" are quite troublesome. please consider rebasing your changes before push next time. I'll try to fix the history (please reset --hard to origin then).

skempf commented 5 years ago

In the function gen-valid-row-slice, case 4 sets up an array of indices to be taken as the slice. The length is (random (dims data-frame)), which sometimes results in a slice that is a zero length array #(). Is this really a valid case? I couldn't think of a use, but maybe?

map-data-frame-rows will accept any invalid function if the row-slice is an empty array. For now, I changed the creation of a valid row slice to use (max 1 (random (dims data-frame))) so my test would work as-is.

I have some changes for you to look at, but I'll wait until tomorrow to push them.

skempf commented 5 years ago

Regarding your use case example above. I think you are correct. Your first table and example plot functions are what I had in mind. I was reading the implementation backwards, but I have straightened up now!

dkochmanski commented 5 years ago

0-length arrays are fine (imagine an "empty" data frame to which we add columns later). Just like you can map over empty sequence (function is not called at all).

skempf commented 5 years ago

Big caveat: this is the first time I have used fiveam.

I setup a function to generate invalid functions to pass to add-cols!. This could also be used for map-data-frame-rows, because that function also takes a function of two arguments (row-index row).

When I construct the map-data-frame-rows.invalid test, I supply it with a combination of valid and invalid arguments. In particular, the combination of supplying a valid row-splice that happens to be an empty array and an invalid function, does not signal an error. This is because the function is never run, as you point out above.

I don't think it would be time well spent to put the required logic into map-data-frame-rows to test the function, even if the row-splice is empty. My inclination is to prevent that particular corner case when running the test for map-data-frame-rows.

dkochmanski commented 5 years ago

(for-all …) bindings have an optional third clause called guard - when guard form yields nil body is not executed. you could do it like this:

(for-all ((valid-rows (gen-valid-row))
          (invalid-cols (gen-invalid-col) (not (emptyp valid-rows))))
  (is (whatever)))

then you do not have to create a separate generator nor it has to be altered to return smaller subset of valid rows.

skempf commented 5 years ago

Thanks; that works great. Just have one more to write. Are you OK if I push the changes? There aren't any changes to origin, so I should be OK.

dkochmanski commented 5 years ago

did you try git fetch + git rebase ? either way it should be OK

skempf commented 5 years ago

There might be a small leak in that the guard blanking is combined with other tests in map-data-frame[-X].invalid, because I think none of the tests for that set of inputs is run if the guard is nil (i.e. empty row or col slice). If the number of tests is high enough, it still should catch all possibilities.

dkochmanski commented 5 years ago

Thank you. I've merged the branch.

TurtleWarePL / Polyclot

add basic directory structure and initial documentation #1