explorable-viz / fluid

Data-linked visualisations
http://f.luid.org
MIT License
34 stars 2 forks source link

Linked inputs figure, draft 1 #826

Closed JosephBond closed 10 months ago

JosephBond commented 10 months ago

First draft of linked inputs example in Fluid, plus enough supporting infrastructure to get a graphic into the paper.

Related tasks:


For PLDI it’s not uncommon to move the motivating example into its own bespoke section, and for now we’re taking the same approach. On that front we definitely have some questions that need answering imminently. The first main question is:

Do we continue with the compute-dtw example and associated? I think we definitely can, but it needs to be done with a lot of care, as the example isn’t a toy problem, which is good.

If we do decide to continue with this example, I think the following points need to be discussed/attempted a few times:

rolyp commented 10 months ago

@JosephBond I’m going to pull those narrative tasks into a new todo for the Abstract and Introduction.

JosephBond commented 10 months ago

The main barrier is coming up with something which uses only partial information from each of the two input sources to parts of the output.

  1. Image segmentation methods realistically induce a fair amount of global dependency, so they're not really appropriate for this. Also, they seem to be generally quite complex, so maybe not an option
  2. Clustering algorithms seem to suffer from a similar global dependency problem.
JosephBond commented 10 months ago

One option I have sort of sketched out is as follows:

  1. Start with a collection of records defining some threshold values, and a matrix of records
  2. For each cell in the matrix, find the set of records it could plausibly belong to (which records in the lookup set, the matrix cell satisfies the conditions for
  3. Then linked inputs from the initial collection of records to the initial matrix are just the cells that have been assigned the appropriate record

Problem: doesn't really demonstrate something that couldn't be solved by just looking at the output matrix, but solution: For each record in the initial collection, we associate some output value. Then when a cell is filled in with the data from that record, we place the output value in there. If a cell could plausibly belong to multiple records, average their output. Then we can't easily see from an output value alone which records were assigned to that cell.

JosephBond commented 10 months ago

As a variant/extension of the above, what about the problem of binning data, where data can be placed into multiple bins at once. In essence, generalizing the above. Potentially this let's us create linked-input dependencies that aren't instantly obvious from the output. (One problem with averaging a time series is that the matching we compute in DTW is really the essential dependency information one would want to know)

So you place items into (potentially multiple) bins, then to each bin some sort of constant is associated. Then for each item, you combine the constants associated with the bins it has been placed into, and that's the output (either list or matrix).

We would need some very simple method for binning everything, and need to give some sort of interpretation from outside PL potentially.

JosephBond commented 10 months ago

Let's take my earlier comment about the Records+Matrix idea: Setup: you are some sort of urban/agricultural/economic planner, looking to distribute some resource spatially, (could be power, could be water for plants, etc)

You have a 2d map containing (potentially structured) data, each cell on the map contains data about what is present there (for example, number of plants, animals in agriculture, the types of building/utility that are there)

For each cell, we associate it with the collection of records which are appropriate, so for example: a cell could have lots of grass, which needs X amount of water, but also lot's of cattle which need Y amount of water. Then in the output, that cell has X+Y water allocated to it (or something equivalent for say electrical power).

Then, linked inputs would let the user select a record, (or part of a record) in the full set of records, and would find via the output "allocation matrix" the set of all cells which matched that record in the original input. Similarly, selecting some cells in the input matrix could go via the output matrix to find the union of the records that your selected input cells were allocated.

This way we get a slightly different flavour of linked outputs in each direction, some level of data parallelism (each cell is completely distinct) and lets us demonstrate the performance and also the linked output capability.

rolyp commented 10 months ago

That last suggestion sounds more promising.

Here’s another to think about, this doesn’t involve matrices but perhaps that’s an advantage as we can start simple and gradually add complexity (e.g. matrices) later. Suppose we have some kind of dataset with two tiers of hierarchical structure, e.g. country and city within country, and then some kind of absolute quantity (e.g. greenhouse gas emissions in metric tons) per city. To aggregate by country (e.g. for a “greenhouse gas emissions in urban areas” report) one would probably want to calculate per capita emissions for each city, which would rely on a different table of urban populations.

Maybe this would lend itself to a bubble chart, especially if we had a third table cross-referencing per-capita GDP for each country (say) which would let us plot emissions (as bubble size) against GDP along one axis (not sure what the other axis would be yet).

We could build out the complexity, e.g. using colour as a 4th dimension (say for geopolitical regions like Africa, Middle East, etc), contrasting with a second bubble chart plotting city emissions per capita rather than country emissions, plotting against other datasets like urban heat island to look for correlations, or adding a temporal dimension. As well as showing how different data source selections induce various usages of other data sources, it would also be good to show how different choices of “mediating output” induce different patterns of usage.

rolyp commented 10 months ago

Renamed to re-focus this particular task around just the example itself – associated narrative etc can be done as part of explorable-viz/graphical-slicing#262. I’m also going to transfer this task to the Fluid repo (which requires a bit of a workaround as it’s a private repo, so give me a couple of mins).

rolyp commented 10 months ago

Ok so this task is now in explorable-viz/fluid and covers coming up with the first draft of the example + test case that exercises it. Exactly when the example can integrate with #796 will depend on how we get on today.

rolyp commented 10 months ago

Rescoped this task to include getting an initial graphic into the paper. Hoping to do this by end of the day.