Devographics / surveys

YAML config files for the Devographics surveys
45 stars 8 forks source link

How to help with data normalization #230

Open SachaG opened 7 months ago

SachaG commented 7 months ago

Video Overview: https://www.youtube.com/watch?v=Sa2i5lYgmT8

What is this about?

As part of the State of JS, HTML, CSS, etc. surveys, we collect a lot of freeform data. In other words, data collected through plain textfields such as this:

Screenshot 2024-04-06 at 6 56 58

As opposed to questions with predefined options such as this one:

Screenshot 2024-04-06 at 6 57 38

This means that before we can visualize this freeform data, we need to normalize it down to a set of canonical tokens. For example, we need to define that answers File upload is hard and Uploading files sucks! both correspond to the same file_uploading_issues token, even though their raw string content might be different.

What does the whole process look like?

This is basically a 4-step process (with a lot of going back-and-forth between steps).

1. Defining Tokens

First, we need to define the tokens that we will normalize towards. It helps a lot to have domain knowledge about the question topic (such as "forms pain points"), and looking at the raw dataset will also help reveal the most obvious patterns.

Screenshot 2024-04-06 at 7 13 12

Our custom-built internal normalization dashboard highlights when the same strings appears multiple time in a row, in order to make it easier to detect reoccurring patterns.

2. Regular Expression Matching

Once we have a good idea of the most common tokens we want to match, we can define regular expressions to do so. This is done in YAML files:

- id: low_priority
  parentId: a11y_company_issues
  patterns:
    - buy(-| )in
    - don't care
    - don't prioritize
    - don't value

Here, we are defining a new low_priority token (in this case, for the "accessibility pain points" question) that will be matched whenever the system encounters a string such as buy-in, buy in, don't care, etc.

Also note the parentId property. Tokens can have parents, which in this case means that the low_priority pain point belongs to the overall a11y_company_issues pain point, which is then useful to display results in a more hierarchical manner:

Screenshot 2024-04-06 at 7 17 00

We also support a set of custom modifiers that you can add to the end of a pattern to change how it will be matched:

Note: words also match plurals (foo will also match foos)

3. Manual Matching

Regular expressions will most likely not catch every answer. At that point, we can then move on to manual matching through the dashboard, where you can manually select tokens:

Screenshot 2024-04-06 at 7 20 47

4. Review

The goal is to match about 70-80% of answers. At this point, we can safely assume that any remaining un-matched answers are niche enough that they would have a large impact on the resulting data visualizations anyway.

We can then review the resulting list, and make sure the matches make sense:

Screenshot 2024-04-06 at 7 23 07

For example, here are all the matches for the styling token:

Screenshot 2024-04-06 at 7 23 26

Which questions are freeform?

When browsing survey results, we indicate that a chart is based on freeform data via a small indicator next to the question itself:

Screenshot 2024-04-06 at 7 00 05

Or, if the chart includes both predefined and freeform data –for example, there was a set of predefined options along with an "other…" freeform option– with an indicator next to the specific bar that results from freeform data:

Screenshot 2024-04-06 at 7 01 02

How can I get the raw data for a question?

If you'd like to obtain the raw data for a question without going through our dashboard, you can also do so via our API:

https://graphiql-h771.onrender.com/

Here is a sample query:

query GetComments {
  surveys {
    state_of_html {
      html2023 {
        accessibility {
          accessibility_pain_points {
            freeform {
              rawData {
                responseId
                raw
                tokens {
                  id
                  pattern
                }
              }
            }
          }
        }
      }
    }
  }
}

Questions to normalize

Here is a partial list of questions whose data needs to be wholly or partially normalized

State of HTML 2023

Trying Things Out

You can access the dashboard here: https://surveyadmin.vercel.app/admin/normalization/ (DM me on Discord for the credentials).

While you can freely click around the dashboard, I've highlighted below the buttons that will affect the dataset. If you're still getting the hang of things, please avoid clicking them! If you do and are not sure how to reverse the change, just DM me to let me know.

Screenshot 2024-04-15 at 10 55 35

How can I help?

If you'd like to help our with any step of the process, you can join our Discord and message me (Sacha) to start coordinating.

Please mention which question(s) of which survey(s) you'd like to help normalize, based on your own knowledge and interests.