Closed janagombitova closed 4 years ago
In response to Case A, it feels that the NUM data type can only be meaningfully used as a sub-bucketing option if it contains a few unique values.. else the user would expect options to define or limit the values. Furthermore, case A would primarily require that the Gender data exist in the same col - since sub-bucketing only refers to the a single col. Secondarily, it would be possible to reference a group of columns when sub-bucketing - which is exactly the challenge we need to look at with MOQs and RQGs - it would however be best if we can storymap this inorder to also handle these (multiple option question and repeat question group data) and figure out the best way forward
Case B - I actually got an error on attempting to load the dataset with the link "Failed to fetch dataset" - but if I understand its context correctly, it feels very much like Case A; the need to split bars; figuring out A thus will solve B
@Kiarii Indeed, A and B are not two separate cases, just separate examples of one situation. Overall I also agree we should discuss this issue with the team. But just to put my thoughts down:
In response to Case A, it feels that the NUM data type can only be meaningfully used as a sub-bucketing option if it contains a few unique values.. else the user would expect options to define or limit the values.
The way NUM columns are handled as sub buckets simply does not work, as you rarely have only a few unique values in the column.
Furthermore, case A would primarily require that the Gender data exist in the same col - since sub-bucketing only refers to a single col. Secondarily, it would be possible to reference a group of columns when sub-bucketing...
If we had the data arranged in this way, this issue would be irrelevant as you would be able to visualise the data. The use case is exactly when you want to use more than one NUM column as a sub bucket.
which is exactly the challenge we need to look at with MOQs and RQGs
I agree with the possible parallel with RQGs if we consider the RQG a separate data table. Then we can get the same situation. But I do not see the overlap with multiple options as the datatype is different - not NUM but TEXT, and data structure is also different as here you have one value per cell where with MOQs there are more values in one cell.
I created an example based on IUCN data here https://docs.google.com/spreadsheets/d/1kQyNeHw9WcI79-3ZJWoS_W3wuWWfRUKok8cHX5K_t9s/edit#gid=0 with how Google sheets handle this type of situation. This example is also the reason why we started this issue as this is a regular data structure our partners collect.
@janagombitova
with MOQs there are more values in one cell
this is the case as long as the data hasn't been 'transformed'; we have been talking about splitting the data so it is viewable on the grid - thus my thinking..
I created an example based on IUCN data
I think your example highlights what we need to figure out. I am of course getting a bit lost figuring out how the aggregation ought to work; shd it be applicable to both axis, does it belong in the X or Y? Atm in Lumen we have it in the Y, but bucketing & sub-bucketing on the X where it would appear we ought to have another aggregation (my thinking hurts :)
Issue: Users asks for the number of students at schools for project X. User asks this question at the start of project (baseline) and at the end of project (end line). Now User wants to make a visualization to show (i.e. in a bar graph) two bars: 1) # of students at the start and 2) # of students at the end. Here is where the issue occurs: Currently datasets in Lumen (after merging baseline with endline) look like this:
School Name | # of students (baseline) | # of students (endline) |
---|---|---|
School 1 | 30 | 40 |
School 2 | 32 | 50 |
Attendance of the following groups have been recorded through separate questions:
TOTAL HOMBRES ADULTOS MUJERES ADULTAS JOVENCITAS y JOVENCITOS NIÑAS y NINOS
Therefore, each demographic has a separate column. When I attempt to create a bar graph showing one variable (Type of Intervention) and then sub-buckets to display a breakdown of the demographics listed above, a count is generated regarding the number of times a session with a certain amount of attendees was recorded (https://lacprogram.akvolumen.org/s/QcoHflPp3TA) rather than demonstrating how many (total) of the demographic Hombre Adultos attended each type of session.
To further the problem understanding, I had a chat with Carmen our in-house data scientist, with Eric (partner team) providing his experience. In the chat, we answered the following questions
Yes, it appears in most GUI based chart editors, it is implicit, but Kibana Elastic explicitly shows this.
Pending strorymapping of course :)
In order to create some consistency, the "Advanced" section should be dedicated to chart labeling and legend options for both X and Y axes; the overarching objective being to un-bundle Sort and Sub-bucket options from the "Advanced" options of the X-axis; this would also make it easier for the user to discover and use these; (the sketch showcases the idea, we would need to decide whether "Advanced" is the bottom most part in the X-axis)
For Datasets "Advanced" should be relabeled "Filters" to use a more direct language and keep the aforementioned intent;
After discussing the issue today with @Kiarii, @kardan and @tangrammer, looking at the case, the research and how Salim would go about creating such charts we realised the following:
Salim wants to create a bar chart where he wants to see the number of men and women per activity.
If he wanted to see the number of men per activity he would define the x-axis to be based on activity (TEXT column) and then the y-axis-metric would be based on men (NUM). He would select how he wants to men data to be handled, aggregated (Does he want to see the SUM of men MEAN thus the average of men per activity). But because he also wants to see the number of women per activity, he would add another column to the y-axis. Then the men data and the women data are handled by the same aggregation method per activity, resulting in him seeing 2 bars (men and women) per activity.
We initially thought that the case should be handled in what we call Sub-buckets
in Lumen. But that is a different thing as you do not define the aggregation method on the x-axis but on the y-axis.
Thus we want to:
https://invis.io/WQUHJW1CT38#/389619771_1-0_X_-_Sub-Bucketing Here are the mockups illustrating how the selection of multple metric cols for the Y axis will work. Do not change how the UI looks and feels, just move things around and add the following changes:
I am available to discuss any questions/comments..
once #2421 was merged/deployed we'll need to use feature flag "?series-bar=1" to see the new changes
After testing the implementation on lumen.akvotest.org under the feature flag ?series-bar=1
I found a few small improvements that can be made to the implementation (also based on the mock-ups):
No series
--> Salim will not really understand what this means and it is just an extra in the UI not really adding a lot of information or benefit. With Juan we agreed to take my 2nd point out from my previous comment. We will push out the change to learn from users if that "issue" is even an issue and then make it into a separate issue that can be handled in the future.
So to push this feature to users we just need to remove the text field No series
and handle the feature flag
Works like a charm!
After testing the small change @tangrammer pushed to dark-test we agreed to not push the change as then users need to guess/know they can select more metric columns. Now it is nice and obvious with the button just being there in your face. Thus this task is done.
Context
Why do we add this issue?
Bar charts are the most used visualisations types in Lumen. And their use is growing further than the standard of analysing two variables, but to more variables in the visualisation. We add this issue to support cases where current stacking (splitting) bars do not work because of the structure of the dataset.
Problem or idea
What is the current status quo
Currently, you can analyse three variables in one bar chart using stacking. This works really well if you select as the 3rd variable a text column. The buckets get split by the unique values. It is magical.
But for number columns, we treat the numbers in the same way as text and split or stack the bars by the unique values. This does not work.
Example A
If we want to make a graph and split between the number of men and women by activity, three separate columns, you need to create two graphs. One graph showing the number of women per activity and the second one showing the number of men per activity.
Ideally, you want the graph to show men and women as separate bars (or stacked) per activity in one bar chart.
Example: https://docs.google.com/spreadsheets/d/1kQyNeHw9WcI79-3ZJWoS_W3wuWWfRUKok8cHX5K_t9s/edit#gid=0
Example B
With this dataset I track user adoption of the new data transformations we create as the assumption is that the use of derived column will decrease and the use of new transformations will increase. Each transformation is one column and the rows are per month. See here: https://akvointernal.akvolumen.org/dataset/5d14c8d1-89c9-443e-92c4-661d095c9369
Today I need to create a separate visualisation per transformation. Ideally, I want to be able to see a bar chart with the usage number of derived columns and category column next to each other per month.
If this change is successful what might we observe?
Solution or next step
How could we solve it?
Could we handle number columns differently in stacking to support this case to be able to end up with something like this?
How will this benefit the users?
They will have less work as they do not need to create more charts (picture you have over 10 indicators where you compare men and women all the time).
How will it benefit Akvo?
The first impression Lumen makes is positive, it is seen as a simple to use product. If we get this right we can further simplify a more complex case. What can result in more advanced users choosing to work with Lumen.