frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.
https://datapackage.org
The Unlicense
491 stars 112 forks source link

Data Transform specs #255

Closed rufuspollock closed 5 months ago

rufuspollock commented 8 years ago

https://github.com/vega/vega/wiki/Data-Transforms

Appendix: Data Transform Research

Plotly Transforms

No libraries for data transform have been found.

Vega Transforms

https://github.com/vega/vega/wiki/Data-Transforms - v2

https://vega.github.io/vega/docs/transforms/ - v3

Vega provided Data Transforms can be used to manipulate datasets before rendering a visualisation. E.g., one may need to perform transformations such as aggregation or filtering (there many types, see link above) of a dataset and display the graph only after that. Another situation would be creating a new dataset by applying various calculations on an old one.

Usually transforms are defined in transform array inside data property.

"Transforms that do not filter or generate new data objects can be used within the transform array of a mark definition to specify post-encoding transforms."

Examples:

Filtering

https://vega.github.io/vega-editor/?mode=vega&spec=parallel_coords

This example filters rows that have both Horsepower and Miles_per_Gallon fields.

{
 "data": [
    {
      "name": "cars",
      "url": "data/cars.json",
      "transform": [
        {
          "type": "filter",
          "test": "datum.Horsepower && datum.Miles_per_Gallon"
        }  
      ]
    }
  ] 
}

Geopath, aggregate, lookup, filter, sort, voronoi and linkpath

https://vega.github.io/vega-editor/?mode=vega&spec=airports

This example has a lot of transforms - in some cases there is only transform applied to a dataset, in other cases there are sequence of transforms.

In the first dataset, it applies geopath transform which maps GeoJSON features to SVG path strings. It uses alberUsa projection type (more about projection).

In the second dataset, it applies sum operation on "count" field and outputs it as "flights" fields.

In the third dataset: 1) it compares its "iata" field against "origin" field of "traffic" dataset. Matching values are outputed as "traffic" field. 2) Next, it filters out all values that are null. 3) After that, it applies geo transform as in the first dataset above. 4) Next, it filters out layout_x and layout_y values that are null. 5) Then, it sorts dataset by traffic.flights field in descending order. 6) After that, it applies voronoi transform to compute voronoi diagram based on "layout_x" and "layout_y" fields.

In the last dataset: 1) First, it filters values on which there is a signal called "hover" (specified in the Vega spec's "signals" property) with "iata" attribute that matches to the dataset's "origin" field. 2) Next, it looks up matching values of "airports" dataset's "iata" field against its "origin" and "destination" fields. Output fields are saved as "_source" and "_target". 3) Filters "_source" and "_target" values that are truthy (not null). 4) Finally, linkpath transform creates visual links between nodes (more about linkpath).

{
  "data": [
    {
      "name": "states",
      "url": "data/us-10m.json",
      "format": {"type": "topojson", "feature": "states"},
      "transform": [
        {
          "type": "geopath", "projection": "albersUsa",
          "scale": 1200, "translate": [450, 280]
        }
      ]
    },
    {
      "name": "traffic",
      "url": "data/flights-airport.csv",
      "format": {"type": "csv", "parse": "auto"},
      "transform": [
        {
          "type": "aggregate", "groupby": ["origin"],
          "summarize": [{"field": "count", "ops": ["sum"], "as": ["flights"]}]
        }
      ]
    },
    {
      "name": "airports",
      "url": "data/airports.csv",
      "format": {"type": "csv", "parse": "auto"},
      "transform": [
        {
          "type": "lookup", "on": "traffic", "onKey": "origin",
          "keys": ["iata"], "as": ["traffic"]
        },
        {
          "type": "filter",
          "test": "datum.traffic != null"
        },
        {
          "type": "geo", "projection": "albersUsa",
          "scale": 1200, "translate": [450, 280],
          "lon": "longitude", "lat": "latitude"
        },
        {
          "type": "filter",
          "test": "datum.layout_x != null && datum.layout_y != null"
        },
        { "type": "sort", "by": "-traffic.flights" },
        { "type": "voronoi", "x": "layout_x", "y": "layout_y" }
      ]
    },
    {
      "name": "routes",
      "url": "data/flights-airport.csv",
      "format": {"type": "csv", "parse": "auto"},
      "transform": [
        { "type": "filter", "test": "hover && hover.iata == datum.origin" },
        {
          "type": "lookup", "on": "airports", "onKey": "iata",
          "keys": ["origin", "destination"], "as": ["_source", "_target"]
        },
        { "type": "filter", "test": "datum._source && datum._target" },
        { "type": "linkpath" }
      ]
    }
  ]
}

Further research on Vega transforms

https://github.com/vega/vega-dataflow-examples/

It is quite difficult to me to read the code as there is not enough documentation. I have included here the simplest example:

vega-dataflow.js contains Dataflow, all transforms and vega's utilities.

<!DOCTYPE HTML>
<html>
  <head>
    <title>Dataflow CountPattern</title>
    <script src="../../build/vega-dataflow.js"></script>
    <style>
      body { margin: 10px; font-family: Helvetica Neue, Arial; font-size: 14px; }
      textarea { width: 800px; height: 200px; }
      pre { font-family: Monaco; font-size: 10px; }
    </style>
  </head>
  <body>
    <textarea id="text"></textarea><br/>
    <input id="slider" type="range" min="2" max="10" value="4"/>
    Frequency Threshold<br/>
    <pre id="output"></pre>
  </body>
</html>

df is a Dataflow instance where we register (.add) functions and parameters - as below on line 36-38. The same with adding transforms - lines 40-44. We can pass different parameters to the transforms depending on requirements of each of them. Event handlers can added by using .on method of the Dataflow instance - lines 46-48.

var tx = vega.transforms; // all transforms 
var out = document.querySelector('#output');
var area = document.querySelector('#text');
area.value = [
  "Despite myriad tools for visualizing data, there remains a gap between the notational efficiency of high-level visualization systems and the expressiveness and accessibility of low-level graphical systems."
].join('\n\n');
var stopwords = "(i|me|my|myself|we|us|our|ours|ourselves|you|your|yours|yourself|yourselves|he|him|his)";

var get = vega.field('data');

function readText(_, pulse) {
  if (this.value) pulse.rem = this.value;
  return pulse.source = pulse.add = [vega.ingest(area.value)];
}

function threshold(_) {
  var freq = _.freq,
      f = function(t) { return t.count >= freq; };
  return (f.fields = ['count'], f);
}

function updatePage() {
  out.innerText = c1.value.slice()
    .sort(function(a,b) {
      return (b.count - a.count)
        || (b.text > a.text ? -1 : a.text > b.text ? 1 : 0);
    })
    .map(function(t) {
      return t.text + ': ' + t.count;
    })
    .join('\n');
}

var df = new vega.Dataflow(), // create a new Dataflow instance
// then add various operators into Dataflow instance:
    ft = df.add(4), // word frequency threshold
    ff = df.add(threshold, {freq:ft})
    rt = df.add(readText),
    // add a transforms (tx):
    cp = df.add(tx.CountPattern, {field:get, case:'lower',
      pattern:'[\\w\']{2,}', stopwords:stopwords, pulse:rt}),
    cc = df.add(tx.Collect, {pulse:cp}),
    fc = df.add(tx.Filter, {expr:ff, pulse:cc}),
    c1 = df.add(tx.Collect, {pulse:fc}),
    up = df.add(updatePage, {pulse: c1});
df.on(df.events(area, 'keyup').debounce(250), rt)
  .on(df.events('#slider', 'input'), ft, function(_, e) { return +e.target.value; })
  .run();

DP Pipelines transforms

DPP provides number of transforms that can be applied to a dataset. However, those transforms cannot be processed inside browsers as the library requires Python scripts to run.

Below is a copy-paste from DPP docs:

concatenate

Concatenates a number of streamed resources and converts them to a single resource.

Parameters:

Example:

- run: concatenate
  parameters: 
    target:
      name: multi-year-report
      path: data/multi-year-report.csv
    sources: 'report-year-20[0-9]{2}'
    fields:
      activity: []
      amount: ['2009_amount', 'Amount', 'AMOUNT [USD]', '$$$']    

In this example we concatenate all resources that look like report-year-<year>, and output them to the multi-year-report resource.

The output contains two fields:

join

Joins two streamed resources.

"Joining" in our case means taking the target resource, and adding fields to each of its rows by looking up data in the source resource.

A special case for the join operation is when there is no target stream, and all unique rows from the source are used to create it. This mode is called deduplication mode - The target resource will be created and deduplicated rows from the source will be added to it.

Parameters:

Important: the "source" resource must appear before the "target" resource in the data-package.

Examples:

- run: join
  parameters: 
    source:
      name: world_population
      key: ["country_code"]
      delete: yes
    target:
      name: country_gdp_2015
      key: ["CC"]
    fields:
      population:
        name: "census_2015"        
    full: true

The above example aims to create a package containing the GDP and Population of each country in the world.

We have one resource (world_population) with data that looks like:

country_code country_name census_2000 census_2015
UK United Kingdom 58857004 64715810
...

And another resource (country_gdp_2015) with data that looks like:

CC GDP (£m) Net Debt (£m)
UK 1832318 1606600
...

The join command will match rows in both datasets based on the country_code / CC fields, and then copying the value in the census_2015 field into a new population field.

The resulting data package will have the world_population resource removed and the country_gdp_2015 resource looking like:

CC GDP (£m) Net Debt (£m) population
UK 1832318 1606600 64715810
...

A more complex example:

- run: join
  parameters: 
    source:
      name: screen_actor_salaries
      key: "{production} ({year})"
    target:
      name: mgm_movies
      key: "{title}"
    fields:
      num_actors:
        aggregate: 'count'
      average_salary:
        name: salary
        aggregate: 'avg'
      total_salaries:
        name: salary
        aggregate: 'sum'
    full: false

This example aims to analyse salaries for screen actors in the MGM studios.

Once more, we have one resource (screen_actor_salaries) with data that looks like:

year production actor salary
2016 Vertigo 2 Mr. T 15000000
2016 Vertigo 2 Robert Downey Jr. 7000000
2015 The Fall - Resurrection Jeniffer Lawrence 18000000
2015 Alf - The Return to Melmack The Rock 12000000
...

And another resource (mgm_movies) with data that looks like:

title director producer
Vertigo 2 (2016) Lindsay Lohan Lee Ka Shing
iRobot - The Movie (2018) Mr. T Mr. T
...

The join command will match rows in both datasets based on the movie name and production year. Notice how we overcome incompatible fields by using different key patterns.

The resulting dataset could look like:

title director producer num_actors average_salary total_salaries
Vertigo 2 (2016) Lindsay Lohan Lee Ka Shing 2 11000000 22000000
...
pwalsh commented 8 years ago

Would this not be better ties to a views spec? views -> transforms?

rufuspollock commented 8 years ago

@pwalsh not sure what you mean exactly. Transforms are independent of views though commonly used with them ...

pwalsh commented 7 years ago

@rufuspollock not sure we need this open as an issue. Let's get a solid views spec, before we discuss transforms between view specs as states.

rufuspollock commented 7 years ago

@pwalsh rather than close I've moved to icebox as this is a genuine issue that I think we will need for the views spec very soon.

rufuspollock commented 7 years ago

UPDATE: just updated the description above with an analysis of some of the different transform systems and languages out there including vega and dp pipelines. We also have working transform support using vega dataflow in https://github.com/frictionlessdata/datapackage-render-js

ppKrauss commented 7 years ago

Hi, I am new in this discussion... This framework for Data Transform, is only for data-visualization?
... Or (my interest) can be used also for enhance traceability of data sources?

pwalsh commented 7 years ago

hi @ppKrauss

No, this is targeted at specifying data transformations for views on data like visualisations.

For a framework around traceability of data sources (data provenance), please see our Pipelines framework.