explorable-viz / fluid

Data-linked visualisations
http://f.luid.org
MIT License
34 stars 2 forks source link

New examples #733

Closed rolyp closed 11 months ago

rolyp commented 1 year ago

We need some new non-trivial motivating examples for the paper.

Current candidates

Other thoughts

Time series distance metrics and relationed notions

Array reshaping and other tensor operations

Image processing

Other statistical/probabilistic analyses

Rejected examples

min-nguyen commented 1 year ago

Re array reshaping, we could look at a workflow for data cleaning and preprocessing. For example:

  1. Loading a CSV file
  2. Removing missing (NaN) and redundant information to create a filtered data set
  3. Deciding on a target column: creating a matrix of dependent variables and a vector of independent variables.
  4. Normalising/combining different columns to create more meaningful variables/features

Re image processing, an example of a basic (but realistic) concrete workflow could be to:

  1. Optional: If we care about RGB images, applying a grayscale conversion.
  2. Applying a simple image blurring technique to perform noise reduction e.g. with a Gaussian Filter
  3. Applying a simple gradient calculation/edge detection algorithm e.g. with a Sobel Filter

There are a lot more possible steps you could compose onto this. I think image processing tells the story of composition quite well, and that could be motivating enough to be different from the POPL paper.

rolyp commented 1 year ago

@min-nguyen These are great. Let’s start thinking about questions you might find yourself (as a programmer) asking in these application domains that could be answered by backwards/forwards slicing or linked inputs/outputs.

Observation: linked inputs ($\triangleright^{\circ} \circ \triangleleft$) and linked outputs ($\triangleleft \circ \triangleright^{\circ}$) reveal different information depending on how much of the pipeline you’re running the analysis over. For example, suppose the pipeline has two steps $\mathsf{parse} \circ \mathsf{lex}$. Then linked inputs over just the $\mathsf{lex}$ step will reveal (for a given input character) what other characters needed to be inspected in order to generate the containing token. But linked inputs over both steps $\mathsf{parse} \circ \mathsf{lex}$ will pull in all the characters that were inspected in order generate the containing syntax node. (I’m probably over simplifying but something like that should be true.)

So it might be worth thinking about how these analysis could help someone understand/debug individual steps (or small sequences of steps) in pipelines such as the ones above.

rolyp commented 1 year ago

I wonder if we can fit Bézier curves into the edge detection example (as a subsequent vectorisation step). That doesn’t sound easy but maybe there are standard techniques. I guess what I’m imagining is a transformation step that interprets the image data as something more structured/domain-specific, so we can show the analysis working bidirectionally across that.

rolyp commented 1 year ago

Added stochastic matrices and PCA (Principle Component Analysis) to candidate examples above.

rolyp commented 1 year ago

Added Bayesian Model Averaging (climate science example from Dominic).

JosephBond commented 12 months ago

Dropping the scale invariant metric for now, too complicated an example. I am currently working on finding an appropriately simple example that involves combining data at multiple resolutions, preferably with locality. I think the basic statistical task of Bezier curve fitting might be inappropriate, as there is still a notion of the outputted curves depending on their entire inputs, in order to best fit the data overall. Unsure how we can reconcile this with our current notions of dependency, as the global dependency structures induced by a lot of statistical tasks are proving to be an issue.

JosephBond commented 12 months ago

Some of the multi-scale models I've found literature on so far seem quite complex. I think there may be a benefit to considering them, modulo some concerns regarding the models themselves. I am currently investigating if it will be possible to consider the model-mixing algorithms, whilst treating the data we use to combine them as static inputs for now. This would potentially obviate the need for potentially complicated probabilistic computation. Bayesian model average approaches still induce some sort of global dependency structure so I think the multi-scale approach might be for the best if we can overcome some of the challenges I've already mentioned. Needs a fair amount more investigation though

rolyp commented 12 months ago

Added some pointers to time series distance metrics. Symbolic Aggregate Approximation (turns a time series into a string by quantizing) might be worth looking at as the string could then be an input to a later processing stage.

JosephBond commented 12 months ago

Current working example is #765 which we can implement simply, and then use as part of a larger pipeline as mentioned above

rolyp commented 12 months ago

Dropping back to Paused while we work on #765.

rolyp commented 11 months ago

I think we have our example (which we’ll gradually flesh out with real data and other scenarios), so closing this.