janagombitova commented 5 years ago

Context

Why do we add this issue?

Bar charts are the most used visualisations types in Lumen. And their use is growing further than the standard of analysing two variables, but to more variables in the visualisation. We add this issue to support cases where current stacking (splitting) bars do not work because of the structure of the dataset.

Problem or idea

What is the current status quo

Currently, you can analyse three variables in one bar chart using stacking. This works really well if you select as the 3rd variable a text column. The buckets get split by the unique values. It is magical.

Screen Shot 2019-05-16 at 14 30 01 (2)

But for number columns, we treat the numbers in the same way as text and split or stack the bars by the unique values. This does not work.

Screen Shot 2019-05-16 at 14 44 47

Example A

If we want to make a graph and split between the number of men and women by activity, three separate columns, you need to create two graphs. One graph showing the number of women per activity and the second one showing the number of men per activity.

Ideally, you want the graph to show men and women as separate bars (or stacked) per activity in one bar chart.

Screen Shot 2019-05-16 at 14 28 27

Example: https://docs.google.com/spreadsheets/d/1kQyNeHw9WcI79-3ZJWoS_W3wuWWfRUKok8cHX5K_t9s/edit#gid=0

Example B

With this dataset I track user adoption of the new data transformations we create as the assumption is that the use of derived column will decrease and the use of new transformations will increase. Each transformation is one column and the rows are per month. See here: https://akvointernal.akvolumen.org/dataset/5d14c8d1-89c9-443e-92c4-661d095c9369

Today I need to create a separate visualisation per transformation. Ideally, I want to be able to see a bar chart with the usage number of derived columns and category column next to each other per month.

If this change is successful what might we observe?

even more use of bar charts
but actually, less bar chart as what now requires two charts can be shown in one

Solution or next step

How could we solve it?

Could we handle number columns differently in stacking to support this case to be able to end up with something like this?

Screen Shot 2019-05-16 at 14 42 10

How will this benefit the users?

They will have less work as they do not need to create more charts (picture you have over 10 indicators where you compare men and women all the time).

How will it benefit Akvo?

The first impression Lumen makes is positive, it is seen as a simple to use product. If we get this right we can further simplify a more complex case. What can result in more advanced users choosing to work with Lumen.

Kiarii commented 5 years ago

In response to Case A, it feels that the NUM data type can only be meaningfully used as a sub-bucketing option if it contains a few unique values.. else the user would expect options to define or limit the values. Furthermore, case A would primarily require that the Gender data exist in the same col - since sub-bucketing only refers to the a single col. Secondarily, it would be possible to reference a group of columns when sub-bucketing - which is exactly the challenge we need to look at with MOQs and RQGs - it would however be best if we can storymap this inorder to also handle these (multiple option question and repeat question group data) and figure out the best way forward

Kiarii commented 5 years ago

Case B - I actually got an error on attempting to load the dataset with the link "Failed to fetch dataset" - but if I understand its context correctly, it feels very much like Case A; the need to split bars; figuring out A thus will solve B

janagombitova commented 5 years ago

@Kiarii Indeed, A and B are not two separate cases, just separate examples of one situation. Overall I also agree we should discuss this issue with the team. But just to put my thoughts down:

In response to Case A, it feels that the NUM data type can only be meaningfully used as a sub-bucketing option if it contains a few unique values.. else the user would expect options to define or limit the values.

The way NUM columns are handled as sub buckets simply does not work, as you rarely have only a few unique values in the column.

Furthermore, case A would primarily require that the Gender data exist in the same col - since sub-bucketing only refers to a single col. Secondarily, it would be possible to reference a group of columns when sub-bucketing...

If we had the data arranged in this way, this issue would be irrelevant as you would be able to visualise the data. The use case is exactly when you want to use more than one NUM column as a sub bucket.

which is exactly the challenge we need to look at with MOQs and RQGs

I agree with the possible parallel with RQGs if we consider the RQG a separate data table. Then we can get the same situation. But I do not see the overlap with multiple options as the datatype is different - not NUM but TEXT, and data structure is also different as here you have one value per cell where with MOQs there are more values in one cell.

I created an example based on IUCN data here https://docs.google.com/spreadsheets/d/1kQyNeHw9WcI79-3ZJWoS_W3wuWWfRUKok8cHX5K_t9s/edit#gid=0 with how Google sheets handle this type of situation. This example is also the reason why we started this issue as this is a regular data structure our partners collect.

Screen Shot 2019-09-09 at 14 15 07

Kiarii commented 5 years ago

@janagombitova

with MOQs there are more values in one cell

this is the case as long as the data hasn't been 'transformed'; we have been talking about splitting the data so it is viewable on the grid - thus my thinking..

I created an example based on IUCN data

I think your example highlights what we need to figure out. I am of course getting a bit lost figuring out how the aggregation ought to work; shd it be applicable to both axis, does it belong in the X or Y? Atm in Lumen we have it in the Y, but bucketing & sub-bucketing on the X where it would appear we ought to have another aggregation (my thinking hurts :)

janagombitova commented 5 years ago

A few examples

The Adventure Project

Issue: Users asks for the number of students at schools for project X. User asks this question at the start of project (baseline) and at the end of project (end line). Now User wants to make a visualization to show (i.e. in a bar graph) two bars: 1) # of students at the start and 2) # of students at the end. Here is where the issue occurs: Currently datasets in Lumen (after merging baseline with endline) look like this:

School Name	# of students (baseline)	# of students (endline)
School 1	30	40
School 2	32	50

LAC Programme

Attendance of the following groups have been recorded through separate questions:

TOTAL HOMBRES ADULTOS MUJERES ADULTAS JOVENCITAS y JOVENCITOS NIÑAS y NINOS

Therefore, each demographic has a separate column. When I attempt to create a bar graph showing one variable (Type of Intervention) and then sub-buckets to display a breakdown of the demographics listed above, a count is generated regarding the number of times a session with a certain amount of attendees was recorded (https://lacprogram.akvolumen.org/s/QcoHflPp3TA) rather than demonstrating how many (total) of the demographic Hombre Adultos attended each type of session.

Screen Shot 2019-09-17 at 09 44 29

Kiarii commented 5 years ago

To further the problem understanding, I had a chat with Carmen our in-house data scientist, with Eric (partner team) providing his experience. In the chat, we answered the following questions

Is data structure the issue?

according to Carmen, the data structure is not ideal, it resembles a a pivot table which needs an unpivoting transform in order to be ideal structure which would look sth like the following

Shd Lumen handle the data structure nonetheless?

Yes, primarily because data structure - being dependent on survey design or data source - will never be ideal, and because the less steps a user needs to take to go from data to visualization, the easier their user journey. In the example above, unpivoting the data might require require the creation of a sub-dataset in which the structure can then be changed from relation to the 'larger' dataset.

Are aggregations applicable to both X and Y axis?

Yes, it appears in most GUI based chart editors, it is implicit, but Kibana Elastic explicitly shows this.

Proposal

Pending strorymapping of course :)

We decided that ways for selecting multiple sub-buckets should be explored, as a the first solution, from which we may learn more and iterate
In regards to MOQs, when we get to it, we thought that Lumen should by default 'know' the relationship between MOQs cols i.e. if the user selects one of the columns for a sub-bucket, the others shd be added to respective as sub-bucket cols.

Kiarii commented 5 years ago

In order to create some consistency, the "Advanced" section should be dedicated to chart labeling and legend options for both X and Y axes; the overarching objective being to un-bundle Sort and Sub-bucket options from the "Advanced" options of the X-axis; this would also make it easier for the user to discover and use these; (the sketch showcases the idea, we would need to decide whether "Advanced" is the bottom most part in the X-axis)

For Datasets "Advanced" should be relabeled "Filters" to use a more direct language and keep the aforementioned intent;

janagombitova commented 5 years ago

After discussing the issue today with @Kiarii, @kardan and @tangrammer, looking at the case, the research and how Salim would go about creating such charts we realised the following:

Case

Salim wants to create a bar chart where he wants to see the number of men and women per activity.

If he wanted to see the number of men per activity he would define the x-axis to be based on activity (TEXT column) and then the y-axis-metric would be based on men (NUM). He would select how he wants to men data to be handled, aggregated (Does he want to see the SUM of men MEAN thus the average of men per activity). But because he also wants to see the number of women per activity, he would add another column to the y-axis. Then the men data and the women data are handled by the same aggregation method per activity, resulting in him seeing 2 bars (men and women) per activity.

Our misunderstanding

We initially thought that the case should be handled in what we call Sub-buckets in Lumen. But that is a different thing as you do not define the aggregation method on the x-axis but on the y-axis.

Changes to Lumen

Thus we want to:

[ ] Allow Salim to select more than one NUM column for the y-axis
[x] Limit current Lumen sub-bucketing to only TEXT columns https://github.com/akvo/akvo-lumen/issues/2389
[ ] If Salim selects more than one y-axis column, do not allow him to define a sub-bucket on the x-axis to keep things simple for the start
[x] See how to reorganise items in the visualisation editor to ensure how Settings that apply on the chart view are shown in one space, and settings that apply on the x-axis and y-axis are shown in those parts of the editor. https://github.com/akvo/akvo-lumen/issues/2359

Kiarii commented 5 years ago

https://invis.io/WQUHJW1CT38#/389619771_1-0_X_-_Sub-Bucketing Here are the mockups illustrating how the selection of multple metric cols for the Y axis will work. Do not change how the UI looks and feels, just move things around and add the following changes:

Add option to select another column in the Y axis
Using multiple cols in the Y disables bucket sub-buckets in the X and vice-versa as shown in the mockups.
in the Y, aggregations are the first field; Count as default (as today) and Mean as the next default once some column is selected
with the Y axis the [X] is used differently than the ones we have; here it removes the entire field (not the selected col name as is the current case) and it should be implemented as illustrated in the mockups

I am available to discuss any questions/comments..

tangrammer commented 4 years ago

once #2421 was merged/deployed we'll need to use feature flag "?series-bar=1" to see the new changes

janagombitova commented 4 years ago

After testing the implementation on lumen.akvotest.org under the feature flag ?series-bar=1 I found a few small improvements that can be made to the implementation (also based on the mock-ups):

Remove No series --> Salim will not really understand what this means and it is just an extra in the UI not really adding a lot of information or benefit.
Once I select the 2nd metric column I do not have the option to remove the 1st one. Can we allow that? So if there is only one column left, then we do not allow to remove it but once there are two and more I can remove any of the columns.

janagombitova commented 4 years ago

With Juan we agreed to take my 2nd point out from my previous comment. We will push out the change to learn from users if that "issue" is even an issue and then make it into a separate issue that can be handled in the future.

So to push this feature to users we just need to remove the text field No series and handle the feature flag

janagombitova commented 4 years ago

Lumen - visualisations - bars with multiple metric columns Works like a charm!

janagombitova commented 4 years ago

After testing the small change @tangrammer pushed to dark-test we agreed to not push the change as then users need to guess/know they can select more metric columns. Now it is nice and obvious with the button just being there in your face. Thus this task is done.

akvo / akvo-lumen

Allow to select more than one NUM column on Y-axis for bar charts #2131

Context

Why do we add this issue?

Problem or idea

What is the current status quo

Example A

Example B

If this change is successful what might we observe?

Solution or next step

How could we solve it?

How will this benefit the users?

How will it benefit Akvo?

A few examples

The Adventure Project

LAC Programme

Is data structure the issue?

Shd Lumen handle the data structure nonetheless?

Are aggregations applicable to both X and Y axis?

Proposal

Case

Our misunderstanding

Changes to Lumen