explorable-viz / fluid

Data-linked visualisations
http://f.luid.org
MIT License
34 stars 2 forks source link

`▽(⊤)` as ambient query context #949

Closed rolyp closed 2 weeks ago

rolyp commented 7 months ago

As discussed the other day, there may be quite a strong case for restricting interest by default to only inputs that are demanded by at least some of the output (i.e. by ⊤). E.g. unused inputs are automatically hidden/greyed out by default and so unnecessary △▽ or △ queries that query unused inputs can be avoided.

This would already be quite a nice/informative view for certain charts, e.g. if they only visualise data from 2018, then all irrelevant rows would already be deselected.

rolyp commented 7 months ago

I think we should prioritise this after the ICFP submission and aim to get it in the final version of the paper. It’s quite important for describing plausible △ and △▽ queries. Moved to Cambridge/Turing RSE Talks milestone for now.

One idea would be to use a lighter turquoise (rather than grey) shading for secondary selections and reserve grey for unused inputs. Perhaps a better idea would be to keep things as they are and just style unused inputs with a grey border, white background and grey text.

We would probably need a new Sel state corresponding to “unused”. At some point it may make sense to formalise the Boolean algebra for Sel, although (so far) it’s a relatively straightforward generalisation of Bool.

rolyp commented 7 months ago

Will also make it possible to automatically include δ_out into out_expect in the test framework, which will make the test specs a bit less redundant.

rolyp commented 2 months ago

Questions from @T-ab-F. I'll have a stab at answering these below. (Apologies if I'm stating anything that's already clear to you, but I guess it's useful to be as explicit as possible.)

What I would like:

  • Clarification on whether “unused” data should include input database primary keys - i.e. in Ex 1, should selecting BRA highlight its dot?, and if it shouldn’t, does that make BRA unused?
  • (if you have a fast answer) Do we have nice database management tools that would make implementing 'find any candidate key of this data' sufficiently short as not to worry?
  • To compare what the current graphing of dependencies entails vs what I now think it should entail (allowing for database-based functional dependencies also).

I’ve been thinking about what data should count as unused. Using Ex 1 as our main example, a great deal of rows should be (clearly) unused. However, one could consider adding irrelevant columns (say, avg electricity price per kWh), and these would also be unused, so we can’t do identification based just on if anything in the row is needed, all is needed. More, the way that we currently think about which data is dependent makes this task seem to give non-intuitive output(like identifying Brazil as unused) /we could change what we view as dependent to make identifying unused input far simpler.

I think we want to keep separate the "content" of the dependence graph (what we deem to depend on what) from what we do with the graph. The former is suboptimal in various ways at the moment and something that (as of today) @JosephBond and I have started work on improving. The notion of "unused input" is then a fully derivative notion in the sense that it's determined purely by the content of the graph: if we select every output, which is what ▽(⊤) means, then any inputs which are not reachable from that selection are by definition unused inputs.

We can think of the idea of an "unused row" (or indeed "unused column", see #952) as a further derivative notion: a row (or column) is unused iff all its cells are, and then visually/UI-wise, there is a potentially a presentation choice as to whether to do something special with unused rows/columns (e.g. hide them by default). If a "used" row or column (i.e. one which has at least one used cell) is visible, then (as part of this task, i.e. #949) we would want to visually indicate which cells are unused.

One last comment relating to this, we might want to allow the user to change what we view as dependent (e.g. by distinguishing different kinds of dependencies that the user can choose to consider/not consider), which would affect the content of the graph but not the meaning of unused input (or unused row/column).

Q: Should functional dependencies in the database between the input be viewed as dependencies in output?

Illustrative example: Compare {country prefix} and {year} in Ex 1. Δ(either) = ⊥, but neither should obviously be omitted. The country code and year form a candidate key for this database, and so we do have a dependency {country prefix, year} -> {other input data} -> output(⊤).

This isn’t what we have right now? - i.e. dependencies between the input aren’t represented in the diagram at the end, and presumably so not in the connection. If 2018 was highlighted, say, then some output should be highlighted - but that’s the price paid for precision.

I think that (if we represent functional dependencies in the input as (X->Y)), then ((X->Y) && Y ∈ (∇⊤)) => X∈ (∇⊤), is necessary.

Now, if the only data that we had in the database from which Ex 1 is drawn was 2018, then the year’s now quite irrelevant, and so output data shouldn’t have a dependency on it (even via functional dependencies on other input data). This is a little annoying, but if the graph is only for 2018 data, and only drawn from a 2018 data database, then that’s not a problem, as presumably the graph could be labelled as such.

For this, then, we could represent the unused data as ¬((∇⊤)∨(a candidate key for anything in ∇⊤)). Deciding which (all?) candidate key(s) should be chosen seems difficult.

Now, we can also think about whether “2018” in the database row is good and a relevant data point to pick, or if we want to say that choosing 2018 in the row actually means 2018-Brazil both, rather than just all instances of 2018? I’m not sure.

This is an interesting question and I’m not sure how to answer it properly yet, except that I do feel (as noted above) that the definition of “unused input” should probably be ¬(∇⊤). The data we have isn’t a database in a formal sense so any “functional dependency” is perhaps best thought of as a contingent fact about how the data is used/interpreted rather than anything strictly represented in the data set.

Whether something is acting as key in the record set thus depends I guess on whether it’s being treated as a key for the purposes of extracting records or joining records across different data sets. Because the way we track dependencies is a bit broken at the moment, there are equality tests for these key-like values that are happening inside the Fluid code but that are not leading to dependencies being established onto those keys. Perhaps when we do a better job of tracking those equality tests, the role of those values as keys will become more apparent and (for example) picking 2018 or BRA in the data set will actually pick out points that depend on it.

Hypothetical to illustrate more, and potentially create another problem: Imagine Ex 1 graph, but with a dropdown above for year of data used, say 2015,2016,2017,2018, giving a different scatter plot for each year. Then do we want the 2018 in the database to be regarded as useful? It’s still a candidate key for the database, but any dependency in the input relation is made moot by this dropdown box. I think that it should be, but it bears thought?

I’ve looked through the git searching for discussions to this issue previously, but mostly have found references to orphaned output data.

I feel like this scenario (and the one above where we only have 2018 data in the record set in the first place) are both sensitive to “implementation details” which again reinforces this idea that it is how the data is concretely used that determines whether a dependency exists or not. For example, if the data set only included 2018 data, then the scatterplot could be simplified to ignore the 2018 key, at which point would become irrelevant (but the scatterplot code would be strictly less general in that if you added some non-2018 data back into the data set, it would no longer visualise the correct information). On other hand it could (redundantly) filter on 2018, which would introduce a “meaningless” dependency, but one which is nevertheless technically present.

For the example with the drop-down, one could imagine an implementation where selecting the drop-down applies a filter which not only selects only (say) 2015 data, but which also does a range-restriction on the result, so that the year field is removed entirely. Then there would be two stages to the pipeline, the filter and then the visualisation applied to the filtered result (with the fixed but implicit year). There would be no year field to depend on at all for the second step, whereas the first step would have a dependency on the year induced by the equality test inside the filter.

I hope the above clarifies the perspective/approach a little bit, but let’s chat more on Thurs. I think what’s obfuscating a bit here is that dependencies on key-like values are currently often lost and some of the questions you are asking might become moot (or less pressing, at least) if we always tracked those properly.

T-ab-F commented 2 months ago

Note that (in ex1, renewable table) categories of clean power creation i.e. bio/solar/wind/hydro are also integral to user interpretation of data, nicely solved by 'include anything in a candidate key', but it feels wrong to say that 'selecting a value done for clean power in Brazil (inc biomass) should then have a secondary dependency on anything that uses any biomass power'. It feels less wrong to say that we should put a dependency down for anything else that involves Brazil, but this isn't the dependency that we are working with at present, i.e. GC, but they are sufficiently illustrative as to make it useful? At the very least, this suggests that if one considers a candidate key to be a dependency (of any kind) worth illustrating, then separate parts of the key should be considered different, but this is heading way too far into "how we should deal with different normalised forms of a database for input/what restrictions could be placed" for the sake of this initial consideration at present.

rolyp commented 2 months ago

Morning @T-ab-F. Today you can probably start looking at the code in App.Util and thinking about what might be involved in introducing the idea of a selection state corresponding to a resource that isn’t used anywhere. Maybe we need a new Unselectable selection state (terminology justified by the fact that a selection of any such element is cancelled by round-tripping through ∇ $\circ$ Δ). We’d need to think about how this interacts with the notion of persistent/transient selection.

Once we have a sketch of how to proceed, we can look for a way to add in the new generality without making use of it straight away, i.e. add the new selection state but retain the existing behaviour. That will then be a good starting point for coming to the question of how we want the behaviour to change.

I guess this will start to surface the questions you were considering yesterday about (apparently unused) keys in the tables. For example, all country cells will appear unselectable, but that won’t make a lot of sense, since a reasonable presentation option for unselectable elements would be to hide them entirely, but then it wouldn’t at all be clear how the scatter plot were able to group by country!

T-ab-F commented 2 months ago

Specifications: Why does Ex 2 have a secondary dependency on data in the input, while it shows no primary dependency (is this bugfix/display slightly off, or is this an interesting function of reverse GC - here, it's true that {country, energytype, year} do determine position in line graph, but the fact that they determine position in the line graph is not displayed nicely in the line graph (is this another UI thing to consider).

Need to consider goals:

Button: ("show unused data" (in input) or "show unsourced data" (in output")

potential tasks (some in order) for dealing with all childless

rolyp commented 2 months ago

Thanks @T-ab-F. Well spotted re. some data elements being reported as “used” (= not childless) in the table view but not inducing any output selection – I hadn’t noticed that. That I think needs to be treated as a bug in the UI, i.e. if some part of the output is demanding the selected input then “at least” that part of the output must appear selected. (Coarsening on the other hand is permissible as per the x/y components of a point example.)

You may notice that in App.View, the function view (which turns a data value that represents a visualisation into something that can actually be rendered) discards the top-level selection state on that value. So we know that at least one piece of information is being lost. I know for a fact there are other places, for example the helper record in App.Util. I’ll create a bug report for this as it will take a bit of thinking to come up with a design where all selection states are reported (perhaps with some coarsening).

T-ab-F commented 2 months ago

sufficiently simultaneous submission with the above

Childless vs node with hidden children

Indeed: consider if the line graph somehow had fewer variables (i.e. only showed US output in these years in one energy type). Then the year (2014, say) would have no visible descendants in the line graph, and yet still have a dependency present, so it wouldn't turn up as childless, but could be very visually confusing - If I click on this piece of data, it comes up in green, and nothing else happens, but if I click on this piece of data, it comes up in orange and nothing else happens. This doesn't tell me what the green data does.

Therefore:

rolyp commented 2 months ago

I’m not sure I’m following your scenario but it may be another case where the coarsening would apply. We have our underlying Galois connection $g$ describing the data dependencies; the UI effectively presents a “view” of $g$. For that view to be “sound”, it needs to form a Galois connection $g’$ which over-approximates $g$. The bug you spotted is a violation of that, whereas the current treatment of points is an example of an over-approximation that confirms to that principle. I didn’t quite follow your example above but it sounds like it might be another situation where approximation would be needed.

Every non-empty visual output selection must be “underwritten” by a non-empty visual input selection, and vice-versa. Over-approximation is always permitted but obviously compromises precision so needs to be used only where really needed!

T-ab-F commented 2 months ago

Ah, this was a "the current example that we have produces secondary data on selection of a node that doesn't produce any selection in the output, and I can construct a plausible example where such node would produce no secondary data", or g' could overapproximate to a degree where there is no context for the node selected. (i.e. in Ex 2, we have {year, energytype, country} all mutually secondarily selected if one is selected, which is interpretable, but if we had only {year} in that type, I would fear that it would show up as childless? ->

we can think about either a) do we assign childlessness based on g' or g, and if it is g' could this be a problem - we only have an example of pseudochildless data with secondary highlighting, and one could consider a lack of secondary highlighting? (which is what I was considering here) - I think it's g, but I'm not sure. or b) g' should be a better approximation (which is your consideration, and seems like a much better decision). | Every non-empty visual output selection must be “underwritten” by a non-empty visual input selection, and vice-versa means that over-approximating is sufficiently dangerous. I will think more.

rolyp commented 2 months ago

Hi @T-ab-F, few quick thoughts before we catch up in person:

rolyp commented 2 months ago

Hi @T-ab-F, to summarise/elaborate on the plan to move forward:

Feel free to ping me as soon as you run into an questions/problems.

rolyp commented 1 month ago

Hi @T-ab-F. I’m adding some subtasks to the issue body above, to track the remaining things we want to do prior to merging to develop. I’ll have a think about the “γ0 in linked inputs mode” thing this morning, and also try to spend some time on the stack overflow problem.

rolyp commented 1 month ago

Morning @T-ab-F. I’ve refined the task list above to split into low-hanging fruit/more involved tasks, and also to flesh out some of the details of the low-hanging fruit.

Unfortunately when I run your branch on my machine, I get a stack overflow at the use of foldM inside GraphImpl.inMap (just in the web app, not considering the tests). I did fix this particular problem the other week in a local branch, but annoyingly I think I accidentally deleted it when re-cloning the repo 🤦. I’ll reapply that fix this morning on a copy of your branch, and then we can sync up. (To be clear, this didn’t resolve the stack overflow in linked-outputs/bar-chart-line-chart but moved it somewhere else, but I’m hoping it will eliminate at least one potential source of the problem and allow the web app to run on my machine.)

rolyp commented 1 month ago

Hi @T-ab-F. I’ve added a rough plan to the task list above of yesterday’s idea of dropping γ0 from Fig, and instead maintaining the set of Inert nodes in γ instead (and similarly for v0). Once we’ve done this I think we’ll be closer to being able to inline SelState into ReactState, since the latter becomes the “currency” we’re working with everywhere.

(Hopefully this will also address the γ0 to be ambient context for linked inputs as well as linked outputs requirement too, since the subset of input (and output) nodes considered Inert will be maintained as an invariant from the outset.)

rolyp commented 3 weeks ago

@T-ab-F I’m capturing some final todo’s in the top of the issue so I can keep track of things! I’ve also merged develop into my copy of your branch. Perhaps we can work on this together for the next couple of days to make sure we get to the bottom of some of the things that are still perplexing? (In which case maybe we can take my copy of your branch as a starting point – but let’s decide Tues morning.)

rolyp commented 2 weeks ago

Morning @T-ab-F. I’ve come around to your position on the isNone Inert issue so I’ve struck that from the list of finalisation tasks.