How to help with data normalization

Video Overview: https://www.youtube.com/watch?v=Sa2i5lYgmT8

What is this about?

As part of the State of JS, HTML, CSS, etc. surveys, we collect a lot of freeform data. In other words, data collected through plain textfields such as this:

As opposed to questions with predefined options such as this one:

This means that before we can visualize this freeform data, we need to normalize it down to a set of canonical tokens. For example, we need to define that answers File upload is hard and Uploading files sucks! both correspond to the same file_uploading_issues token, even though their raw string content might be different.

What does the whole process look like?

This is basically a 4-step process (with a lot of going back-and-forth between steps).

1. Defining Tokens

First, we need to define the tokens that we will normalize towards. It helps a lot to have domain knowledge about the question topic (such as "forms pain points"), and looking at the raw dataset will also help reveal the most obvious patterns.

Our custom-built internal normalization dashboard highlights when the same strings appears multiple time in a row, in order to make it easier to detect reoccurring patterns.

2. Regular Expression Matching

Once we have a good idea of the most common tokens we want to match, we can define regular expressions to do so. This is done in YAML files:

- id: low_priority
  parentId: a11y_company_issues
  patterns:
    - buy(-| )in
    - don't care
    - don't prioritize
    - don't value

Here, we are defining a new low_priority token (in this case, for the "accessibility pain points" question) that will be matched whenever the system encounters a string such as buy-in, buy in, don't care, etc.

Also note the parentId property. Tokens can have parents, which in this case means that the low_priority pain point belongs to the overall a11y_company_issues pain point, which is then useful to display results in a more hierarchical manner:

We also support a set of custom modifiers that you can add to the end of a pattern to change how it will be matched:

[p]: Match partial word fragments.
[l]: Comma-separated list of items to match in any order.
[e]: Match entire answer exactly.
[w]: Match whole words (default).

Note: words also match plurals (foo will also match foos)

3. Manual Matching

Regular expressions will most likely not catch every answer. At that point, we can then move on to manual matching through the dashboard, where you can manually select tokens:

4. Review

The goal is to match about 70-80% of answers. At this point, we can safely assume that any remaining un-matched answers are niche enough that they would have a large impact on the resulting data visualizations anyway.

We can then review the resulting list, and make sure the matches make sense:

For example, here are all the matches for the styling token:

Which questions are freeform?

When browsing survey results, we indicate that a chart is based on freeform data via a small indicator next to the question itself:

Or, if the chart includes both predefined and freeform data –for example, there was a set of predefined options along with an "other…" freeform option– with an indicator next to the specific bar that results from freeform data:

How can I get the raw data for a question?

If you'd like to obtain the raw data for a question without going through our dashboard, you can also do so via our API:

https://graphiql-h771.onrender.com/

Here is a sample query:

query GetComments {
  surveys {
    state_of_html {
      html2023 {
        accessibility {
          accessibility_pain_points {
            freeform {
              rawData {
                responseId
                raw
                tokens {
                  id
                  pattern
                }
              }
            }
          }
        }
      }
    }
  }
}

Questions to normalize

Here is a partial list of questions whose data needs to be wholly or partially normalized

State of HTML 2023

forms_pain_points
interactivity_pain_points
content_pain_points
web_components_libraries
using_web_components_pain_points
making_web_components_pain_points
accessibility_disabilities
accessibility_techniques
accessibility_screenreaders
accessibility_tools
accessibility_pain_points
native_apps_tools
mobile_web_apps_pain_points
html_interoperability_features
html_functionality_features
html_missing_elements
what_do_you_use_html_for

Trying Things Out

You can access the dashboard here: https://surveyadmin.vercel.app/admin/normalization/ (DM me on Discord for the credentials).

While you can freely click around the dashboard, I've highlighted below the buttons that will affect the dataset. If you're still getting the hang of things, please avoid clicking them! If you do and are not sure how to reverse the change, just DM me to let me know.

How can I help?

If you'd like to help our with any step of the process, you can join our Discord and message me (Sacha) to start coordinating.

Please mention which question(s) of which survey(s) you'd like to help normalize, based on your own knowledge and interests.

Devographics / surveys