Closed timhall closed 10 years ago
@timhall Looks good. I'd like to discuss the simple/complex
parts of the usage example above as well as the addRows
part on line 502 of data.js tomorrow during stand-up.
An accumulated
transform example would be helpful. It could be used to a) show how we could add a new series to the store based on existing data via a new method on the store object and b) how we could create a transformed chart series on-the-fly via the transform()
method on the subset object.
I've updated the Store
behavior to add loading and denormalization behavior. Here's how it works now:
var store = new data.Store();
// Load csv files (in parallel, asynchronously)
store.load(['a.csv', 'b.csv']);
/*
Example:
[
{year: '2000', maleResults: '100', femaleResults: '200', __filename: 'a.csv'},
{year: '2001', maleResults: '300', femaleResults: '400', __filename: 'b.csv'}
]
*/
// Register normalizer (used to convert raw data strings to values)
store.normalize(function(row) {
return {
year: new Date(+row.year, 1, 0),
maleResults: +row.maleResults,
femaleResults: +row.femaleResults
}
});
// Register denormalizer (used to map raw data to denormalized table)
store.denormalize({
x: 'year',
y: {
columns: ['maleResults', 'femaleResults'],
// Map y-column -> category values
categories: {
maleResults: {gender: 'male'},
femaleResults: {gender: 'female'}
}
/*
Alternatives (lots of possibilities):
category: 'resultType' -> resultType: 'maleResults' or 'femaleResults'
or
categories: {
maleResults: {isMale: true, isFemale: false},
femaleResults: {isMale: false, isFemale: true},
}
*/
}
});
// Get values (once load + normalize + denormalize completes)
store.values(function(rows) {
/*
rows = [
{x: 2000, y: 100, gender: 'male', __filename: 'a.csv'},
{x: 2000, y: 200, gender: 'female', __filename: 'a.csv'},
{x: 2001, y: 300, gender: 'male', __filename: 'b.csv'},
{x: 2001, y: 400, gender: 'female', __filename: 'b.csv'}
]
*/
});
Also added caching of loaded (and loading) values so request for csv happens just once, even when it's currently being loaded.
Yep, this is looking great. Couple of suggestions:
normalize
and denormalize
. Something like cast
and map
might be more intuitive? Those are a bit overloaded too, though.load
call. So instead of just sending the filepath, we might send a config object (including a filename) with the cast+map pieces encapsulated. Or maybe two parameters -- the file (or list of files) + a config object. Otherwise, it seems that we're limited to only one type of cast+map config for all files. But perhaps you've handled this in a different way?Yeah, I wasn't crazy about those names either, I'll rename them.
It's currently organized to use just a single cast+map, I'm still trying to figure out the best way to configure it to use a custom mapping per file. The main issue with setting it per file was the intended query format (example below) was going to include a needs: [files...]
parameter so that loading can be done as-needed per query and each query doesn't need to filter out by __filename
directly. But this would rely on the cast+map configured for the store and I don't think it would be proper to put the mapping with the query. I have some ideas on it, just hasn't been added currently.
Idea for query:
store.query({
/*
Raw: year, input, normalizedInput, output, normalizedOutput
Denormalize:
x, y, input (true/false), output (true/false), normalized (true/false)
*/
input: {
filter: {input: true},
group: {
normalized: {
// Separate into series key, name
true: {key: 'normalized-input', name: 'Normalized Input'},
false: {key: 'input', name: 'Input'}
}
}
},
results: {
filter: {input: false},
group: {
normalized: {
true: {key: 'normalized-results', name: 'Normalized Results'},
false: {key: 'results', name: 'Results'}
}
}
},
// Load this file (if necessary) and limit all results to this file
needs: 'chart1.csv'
})
What does true: {...}
and false: {...}
mean?
That's mapping the group value (in this case the choices are true/false) to the series to put values that are in that group in. In the store example it would be:
group: {
gender: {
male: {...},
female: {...}
}
}
The issue is that whenever a split happens, we need to know what to call the children of that split so that the series will have names, classes, etc. In this case, we're grouping/splitting by gender
, but need to name the groups so as it is now it's mapped to group value (male
/female
)
Oh, I see. Yes, that makes perfect sense. Perhaps, let's change the (arbitrarily chosen) name of true
to normalized
and false
to not-normalized
. In fact, let's change normalized
to something else as it might be mis-interpreted as a database/structure descriptor. Perhaps estimated
and observed
would work?
Ok, I've refactored the Store
implementation to allow for cast
and map
options per load
and fleshed out cast
to allow for options. An example:
// Store defaults for cast/map
store.cast({
a: 'Number',
b: 'Number',
c: 'Number',
isNew: 'Boolean',
lastModified: 'Date',
}).map({
x: 'a',
y: ['b', 'c']
});
store.load('a.csv', {
cast: {
// special cast() for a.csv...
},
map: {
// special map() for a.csv...
}
});
Also, started on some work for Query
with a new matcher
utility for use with filter
and some other places (based on MongoDB query language):
row = {a: 10, b: 3.14, c: true, d: 'testing'}
matcher({a: 10}, row) // -> true
matcher({$and: {a: 10, c: true}}, row) // -> true
matcher({$or: {a: {$gt: 4}, b: {$lte: 0}}}, row) // -> true
matcher({d: {$in: ['test', 'testing']}}, row) // -> true
Yep -- the store piece is looking good. Let's talk a bit regarding MongoDB language on Monday. Also, Peter might have some insight for us as he's done some recent research.
I've fleshed out the Query
implementation and it's advanced enough now to handle a detailed MISCAN example (currently in non-committed part, but here is part of it):
// Parameters
example.parameters = {
population: ['wm', 'ww', 'bm', 'bw'],
populationDistribution: {wm: 0.2233, ww: 0.3811, bm: 0.2229, bw: 0.1726},
budget: 1000000,
program: ['col60', 'col5064', 'fit1', 'onefit'],
horizon: ['1yr', '2yr', '3yr', '5yr']
};
var SIMULATION_BUDGET = 1000000;
// Add mapping by filename
data.castByFilename(function(filename, row) {
var parameters = helpers.filename.toParameters(filename);
// Convert to Numbers
row.year = +row.year;
row.prog_cost = +row.prog_cost;
row.crc_deaths_prev = +row.crc_deaths_prev;
row.lys_gained = +row.lys_gained;
row.qalys_gained = +row.qalys_gained;
row.cols_in_prog = +row.cols_in_prog;
// Add parameters from filename
row.population = parameters.population;
row.program = parameters.program;
row.horizon = parameters.horizon;
return row;
});
data.map({
x: 'year',
y: {
columns: ['crc_deaths_prev', 'lys_gained', 'qalys_gained', 'cols_in_prog'],
category: 'type'
}
});
// Generate query from parameters
example.query = function query(parameters, query) {
parameters = _.defaults(parameters || {}, example.parameters);
// Convert parameters to parameterized array
var parameterized = helpers.parameterize(parameters);
// Convert parameters to files needed
var files = _.map(parameterized, helpers.filename.fromParameters);
// Create query
return data.query(_.extend(query, {
from: files,
groupBy: ['program', 'horizon', 'type'],
postprocess: function(values, meta) {
// Weight y-values for each year by population
var rowsByYear = {};
_.each(values, function(row) {
var rowByYear = rowsByYear[row.x];
if (!rowByYear) {
rowByYear = rowsByYear[row.x] = _.extend({}, row);
// Reset y and remove population
rowByYear.y = 0;
delete rowByYear.population;
}
// Weighted average of y by population
var populationWeight = parameters.populationDistribution[row.population];
rowByYear.y += row.y * populationWeight;
});
values = _.values(rowsByYear);
// Weight all y-values by given budget
var budgetWeight = parameters.budget / SIMULATION_BUDGET;
_.each(values, function(row) {
row.y *= budgetWeight;
});
return values;
}
}));
return query;
};
// Goal
var crcDeathsByProgram = example.query(example.parameters, {
filter: {
horizon: '1yr',
type: 'crc_deaths_prev'
},
series: [
{meta: {program: 'col60', horizon: '1yr', type: 'crc_deaths_prev'}, key: 'col60', name: '10-yearly colonoscopy'},
{meta: {program: 'col5064', horizon: '1yr', type: 'crc_deaths_prev'}, key: 'col5064', name: 'one time colonoscopy'},
{meta: {program: 'fit1', horizon: '1yr', type: 'crc_deaths_prev'}, key: 'fit1', name: 'yearly FIT'},
{meta: {program: 'onefit', horizon: '1yr', type: 'crc_deaths_prev'}, key: 'onefit', name: '2-yearly FIT'}
]
});
/*
results: -> [
{
key: 'col60',
name: '10-yearly colonoscopy',
meta: {program: 'col60', horizon: '1yr', type: 'crc_deaths_prev'},
values: [
{x: 2013, y: weighted by population and budget}
...
]
},
{
key: 'col5064',
name: 'one time colonoscopy',
meta: {program: 'col5064', horizon: '1yr', type: 'crc_deaths_prev'},
values: [
{x: 2013, y: weighted by population and budget}
...
]
},
...
]
*/
Ok, I've moved everything in data over to data-manager so this branch will add the following:
Add Store, Subset, and Series
Store
contains all data and will includeload
and other methods for loading data from csv/jsonSubset
allows for processing store without changes affecting storeSeries
contains helpers for converting raw data to series representationGoal usage: