Data layer spike - Githubissues

timhall commented 10 years ago

Add Store, Subset, and Series

Store contains all data and will include load and other methods for loading data from csv/json
Subset allows for processing store without changes affecting store
Series contains helpers for converting raw data to series representation

Goal usage:

store = new data.Store();
store.load('data.csv')
  .normalize(function(row) {
    // Normalize data as it comes in into expected format
    row.date = new Date(row.date);

    return row;
  })
  .filter(function(row) {
    // Apply global filter to data
    return !row.isEstimate;
  })
  .values(function(values) {
    console.log('all values (loaded async)');
  });

// Create subset for processing into chart data
var chartData = store.subset()
  .transform(function(row) {
    // Add helper column just for chart
    row.avg = (row.a + row.b) / 2;
    return row;
  })
  .filter(function(row) {
    // Filter rows for chart
    return row.year >= 2000;
  })
  .process(function(rows) {
    // Process all rows into series form
    return {
      simple: data.Series(rows)
        .x('year')
        .y({yValue: 'avg', series: {name: 'Avg'}})
        .value(),
      complex: data.Series(rows)
        .x('year')
        .y([
          {yValue: 'a', series: {name: 'A'}},
          {yValue: 'b', series: {name: 'B'}}
        ])
        .value()
    };
  });

// Load values and draw chart
chartData.values(function(values) {
  // Processed data (loaded async)
  chart.draw(values);
});

john-clarke commented 10 years ago

@timhall Looks good. I'd like to discuss the simple/complex parts of the usage example above as well as the addRows part on line 502 of data.js tomorrow during stand-up.

An accumulated transform example would be helpful. It could be used to a) show how we could add a new series to the store based on existing data via a new method on the store object and b) how we could create a transformed chart series on-the-fly via the transform() method on the subset object.

timhall commented 10 years ago

I've updated the Store behavior to add loading and denormalization behavior. Here's how it works now:

var store = new data.Store();

// Load csv files (in parallel, asynchronously)
store.load(['a.csv', 'b.csv']);

/*
Example:
[
  {year: '2000', maleResults: '100', femaleResults: '200', __filename: 'a.csv'},
  {year: '2001', maleResults: '300', femaleResults: '400', __filename: 'b.csv'}
]
*/

// Register normalizer (used to convert raw data strings to values)
store.normalize(function(row) {
  return {
    year: new Date(+row.year, 1, 0),
    maleResults: +row.maleResults,
    femaleResults: +row.femaleResults
  }
});

// Register denormalizer (used to map raw data to denormalized table)
store.denormalize({
  x: 'year',
  y: {
    columns: ['maleResults', 'femaleResults'],

    // Map y-column -> category values
    categories: {
      maleResults: {gender: 'male'},
      femaleResults: {gender: 'female'}
    }

    /*
      Alternatives (lots of possibilities):
      category: 'resultType' -> resultType: 'maleResults' or 'femaleResults'
      or
      categories: {
        maleResults: {isMale: true, isFemale: false},
        femaleResults: {isMale: false, isFemale: true},
      }
    */
  }
});

// Get values (once load + normalize + denormalize completes)
store.values(function(rows) {
  /*
  rows = [
    {x: 2000, y: 100, gender: 'male', __filename: 'a.csv'},
    {x: 2000, y: 200, gender: 'female', __filename: 'a.csv'},
    {x: 2001, y: 300, gender: 'male', __filename: 'b.csv'},
    {x: 2001, y: 400, gender: 'female', __filename: 'b.csv'}
  ]
  */
});

Also added caching of loaded (and loading) values so request for csv happens just once, even when it's currently being loaded.

john-clarke commented 10 years ago

Yep, this is looking great. Couple of suggestions:

perhaps we can find a better name for normalize and denormalize. Something like cast and map might be more intuitive? Those are a bit overloaded too, though.
I think we'll want to pass a cast+map configuration for each load call. So instead of just sending the filepath, we might send a config object (including a filename) with the cast+map pieces encapsulated. Or maybe two parameters -- the file (or list of files) + a config object. Otherwise, it seems that we're limited to only one type of cast+map config for all files. But perhaps you've handled this in a different way?

timhall commented 10 years ago

Yeah, I wasn't crazy about those names either, I'll rename them.

It's currently organized to use just a single cast+map, I'm still trying to figure out the best way to configure it to use a custom mapping per file. The main issue with setting it per file was the intended query format (example below) was going to include a needs: [files...] parameter so that loading can be done as-needed per query and each query doesn't need to filter out by __filename directly. But this would rely on the cast+map configured for the store and I don't think it would be proper to put the mapping with the query. I have some ideas on it, just hasn't been added currently.

Idea for query:

store.query({
  /*
    Raw: year, input, normalizedInput, output, normalizedOutput
    Denormalize:
      x, y, input (true/false), output (true/false), normalized (true/false)
  */

  input: {
    filter: {input: true},
    group: {
      normalized: {
        // Separate into series key, name
        true: {key: 'normalized-input', name: 'Normalized Input'},
        false: {key: 'input', name: 'Input'}
      }
    }
  },
  results: {
    filter: {input: false},
    group: {
      normalized: {
        true: {key: 'normalized-results', name: 'Normalized Results'},
        false: {key: 'results', name: 'Results'}
      }
    }
  },

  // Load this file (if necessary) and limit all results to this file
  needs: 'chart1.csv'
})

john-clarke commented 10 years ago

What does true: {...} and false: {...} mean?

timhall commented 10 years ago

That's mapping the group value (in this case the choices are true/false) to the series to put values that are in that group in. In the store example it would be:

group: {
  gender: {
    male: {...},
    female: {...}
  }
}

The issue is that whenever a split happens, we need to know what to call the children of that split so that the series will have names, classes, etc. In this case, we're grouping/splitting by gender, but need to name the groups so as it is now it's mapped to group value (male/female)

john-clarke commented 10 years ago

Oh, I see. Yes, that makes perfect sense. Perhaps, let's change the (arbitrarily chosen) name of true to normalized and false to not-normalized. In fact, let's change normalized to something else as it might be mis-interpreted as a database/structure descriptor. Perhaps estimated and observed would work?

timhall commented 10 years ago

Ok, I've refactored the Store implementation to allow for cast and map options per load and fleshed out cast to allow for options. An example:

// Store defaults for cast/map
store.cast({
  a: 'Number',
  b: 'Number',
  c: 'Number',
  isNew: 'Boolean',
  lastModified: 'Date',
}).map({
  x: 'a',
  y: ['b', 'c']
});

store.load('a.csv', {
  cast: {
    // special cast() for a.csv...
  },
  map: {
    // special map() for a.csv...
  }
});

Also, started on some work for Query with a new matcher utility for use with filter and some other places (based on MongoDB query language):

row = {a: 10, b: 3.14, c: true, d: 'testing'}
matcher({a: 10}, row) // -> true
matcher({$and: {a: 10, c: true}}, row) // -> true
matcher({$or: {a: {$gt: 4}, b: {$lte: 0}}}, row) // -> true
matcher({d: {$in: ['test', 'testing']}}, row) // -> true

john-clarke commented 10 years ago

Yep -- the store piece is looking good. Let's talk a bit regarding MongoDB language on Monday. Also, Peter might have some insight for us as he's done some recent research.

timhall commented 10 years ago

I've fleshed out the Query implementation and it's advanced enough now to handle a detailed MISCAN example (currently in non-committed part, but here is part of it):

// Parameters
example.parameters = {
  population: ['wm', 'ww', 'bm', 'bw'],
  populationDistribution: {wm: 0.2233, ww: 0.3811, bm: 0.2229, bw: 0.1726},
  budget: 1000000,
  program: ['col60', 'col5064', 'fit1', 'onefit'],
  horizon: ['1yr', '2yr', '3yr', '5yr']
};
var SIMULATION_BUDGET = 1000000;

// Add mapping by filename
data.castByFilename(function(filename, row) {
  var parameters = helpers.filename.toParameters(filename);

  // Convert to Numbers
  row.year = +row.year;
  row.prog_cost = +row.prog_cost;
  row.crc_deaths_prev = +row.crc_deaths_prev;
  row.lys_gained = +row.lys_gained;
  row.qalys_gained = +row.qalys_gained;
  row.cols_in_prog = +row.cols_in_prog;

  // Add parameters from filename
  row.population = parameters.population;
  row.program = parameters.program;
  row.horizon = parameters.horizon;

  return row;
});
data.map({
  x: 'year',
  y: {
    columns: ['crc_deaths_prev', 'lys_gained', 'qalys_gained', 'cols_in_prog'],
    category: 'type'
  }
});

// Generate query from parameters
example.query = function query(parameters, query) {
  parameters = _.defaults(parameters || {}, example.parameters);

  // Convert parameters to parameterized array
  var parameterized = helpers.parameterize(parameters);

  // Convert parameters to files needed
  var files = _.map(parameterized, helpers.filename.fromParameters);

  // Create query
  return data.query(_.extend(query, {
    from: files,
    groupBy: ['program', 'horizon', 'type'],
    postprocess: function(values, meta) {
      // Weight y-values for each year by population
      var rowsByYear = {};
      _.each(values, function(row) {
        var rowByYear = rowsByYear[row.x];
        if (!rowByYear) {
          rowByYear = rowsByYear[row.x] = _.extend({}, row);

          // Reset y and remove population
          rowByYear.y = 0;
          delete rowByYear.population;
        }

        // Weighted average of y by population
        var populationWeight = parameters.populationDistribution[row.population];
        rowByYear.y += row.y * populationWeight;
      });
      values = _.values(rowsByYear);

      // Weight all y-values by given budget
      var budgetWeight = parameters.budget / SIMULATION_BUDGET;
      _.each(values, function(row) {
        row.y *= budgetWeight;
      });

      return values;
    }
  }));

  return query;
};

// Goal
var crcDeathsByProgram = example.query(example.parameters, {
  filter: {
    horizon: '1yr',
    type: 'crc_deaths_prev'
  },
  series: [
    {meta: {program: 'col60', horizon: '1yr', type: 'crc_deaths_prev'}, key: 'col60', name: '10-yearly colonoscopy'},
    {meta: {program: 'col5064', horizon: '1yr', type: 'crc_deaths_prev'}, key: 'col5064', name: 'one time colonoscopy'},
    {meta: {program: 'fit1', horizon: '1yr', type: 'crc_deaths_prev'}, key: 'fit1', name: 'yearly FIT'},
    {meta: {program: 'onefit', horizon: '1yr', type: 'crc_deaths_prev'}, key: 'onefit', name: '2-yearly FIT'}
  ]
});

/*
results: -> [
  {
    key: 'col60',
    name: '10-yearly colonoscopy',
    meta: {program: 'col60', horizon: '1yr', type: 'crc_deaths_prev'}, 
    values: [
      {x: 2013, y: weighted by population and budget}
      ...
    ]
  },
  {
    key: 'col5064',
    name: 'one time colonoscopy',
    meta: {program: 'col5064', horizon: '1yr', type: 'crc_deaths_prev'}, 
    values: [
      {x: 2013, y: weighted by population and budget}
      ...
    ]
  },
  ...
]
*/

timhall commented 10 years ago

Ok, I've moved everything in data over to data-manager so this branch will add the following:

Updated build process
Start hover and resize work
InvertedXY
Lots of bugfixes

CSNW / d3.compose

Data layer spike #7