Swirrl / grafter

Linked Data & RDF Manufacturing Tools in Clojure
Eclipse Public License 1.0
188 stars 17 forks source link

Add a simple pipe debug function #22

Open RickMoynihan opened 9 years ago

RickMoynihan commented 9 years ago

Simple pipe debug function

Perhaps this already exists and I've missed this in the codebase, but I really want to see what's going on in the pipeline. I'd imagine that some real-world pipelines would contain many more steps than the convert-persons-data example.

I'd like to see the inclusion of some simple functions that help debug each step of the data tranformation. Clearly we have tools like dotrace at our disposal, but they don't know how to print Datasets. For example, I'd like to see something like this added to the grafter.tabular namespace:

(defn debug-ds
  "Pretty print a dataset & return the dataset. Useful for debugging pipelines.
  Takes an optional output label which will help distinguish multiple outputs"
  ([dataset]
    (debug-ds dataset nil))
  ([dataset label]
    (do
      (when label
        (println label))
      (clojure.pprint/print-table (:column-names dataset)
                                  (:rows dataset))
      dataset)))

Would be used like any other pipeline function and deleted when debugging is over:

(defpipe convert-persons-data
  "Pipeline to convert tabular persons data into a different tabular format."
  [data-file]
  (-> (read-dataset data-file)
      (drop-rows 1)
      (debug-ds "Rows dropped")
      (make-dataset [:name :sex :age])
      (derive-column :person-uri [:name] base-id)
      (mapc {:age ->integer
             :sex {"f" (s "female")
                   "m" (s "male")}})))

As you might expect, the REPL output displays intermediate Dataset stage:

(convert-persons-data "./data/example-data.csv")
Rows dropped

|     a | b |  c |
|-------+---+----|
| Alice | f | 34 |
|   Bob | m | 63 |
=> 
| :name |   :sex | :age |                   :person-uri |
|-------+--------+------+-------------------------------|
| Alice | female |   34 | http://my-domain.com/id/Alice |
|   Bob |   male |   63 |   http://my-domain.com/id/Bob |

This somewhat naïve implementation might be useful enough as is. Clearly this should only be used on a small sample of data because all that IO and side effects would be a terrible.

Maybe be some sort of threading debug macro e.g. dbg-> would be more useful, because you could then simply rename the normal Thread First Macro in-place.

Robsteranium commented 9 years ago

The pretty print can be difficult to navigate on wide datasets so I've been using incanter.core/view which brings up a spring gui with some spreadsheet affordances (notably it compresses column widths).

(defn view [dataset]
  (incanter.core/view dataset)
  dataset)

A debug macro sounds interesting, perhaps it should only take e.g. 5 rows though - otherwise the output would quickly become hard to read/ navigate. I wonder whether the label is surplus to requirements in this case - it might be sufficient to print the last line called.

scottlowe commented 9 years ago

@Robsteranium Don't see why we can't have both plain text and GUI versions; it's easy to do. I added the optional label in the example just in case the poor developer got lost, but "line called" is as good.