As part of the State of JS, HTML, CSS, etc. surveys, we collect a lot of freeform data. In other words, data collected through plain textfields such as this:
As opposed to questions with predefined options such as this one:
This means that before we can visualize this freeform data, we need to normalize it down to a set of canonical tokens. For example, we need to define that answers File upload is hard and Uploading files sucks! both correspond to the same file_uploading_issues token, even though their raw string content might be different.
What does the whole process look like?
This is basically a 4-step process (with a lot of going back-and-forth between steps).
1. Defining Tokens
First, we need to define the tokens that we will normalize towards. It helps a lot to have domain knowledge about the question topic (such as "forms pain points"), and looking at the raw dataset will also help reveal the most obvious patterns.
Our custom-built internal normalization dashboard highlights when the same strings appears multiple time in a row, in order to make it easier to detect reoccurring patterns.
2. Regular Expression Matching
Once we have a good idea of the most common tokens we want to match, we can define regular expressions to do so. This is done in YAML files:
- id: low_priority
parentId: a11y_company_issues
patterns:
- buy(-| )in
- don't care
- don't prioritize
- don't value
Here, we are defining a new low_priority token (in this case, for the "accessibility pain points" question) that will be matched whenever the system encounters a string such as buy-in, buy in, don't care, etc.
Also note the parentId property. Tokens can have parents, which in this case means that the low_priority pain point belongs to the overall a11y_company_issues pain point, which is then useful to display results in a more hierarchical manner:
We also support a set of custom modifiers that you can add to the end of a pattern to change how it will be matched:
[p]: Match partial word fragments.
[l]: Comma-separated list of items to match in any order.
[e]: Match entire answer exactly.
[w]: Match whole words (default).
Note: words also match plurals (foo will also match foos)
3. Manual Matching
Regular expressions will most likely not catch every answer. At that point, we can then move on to manual matching through the dashboard, where you can manually select tokens:
4. Review
The goal is to match about 70-80% of answers. At this point, we can safely assume that any remaining un-matched answers are niche enough that they would have a large impact on the resulting data visualizations anyway.
We can then review the resulting list, and make sure the matches make sense:
For example, here are all the matches for the styling token:
Which questions are freeform?
When browsing survey results, we indicate that a chart is based on freeform data via a small indicator next to the question itself:
Or, if the chart includes both predefined and freeform data –for example, there was a set of predefined options along with an "other…" freeform option– with an indicator next to the specific bar that results from freeform data:
How can I get the raw data for a question?
If you'd like to obtain the raw data for a question without going through our dashboard, you can also do so via our API:
While you can freely click around the dashboard, I've highlighted below the buttons that will affect the dataset. If you're still getting the hang of things, please avoid clicking them! If you do and are not sure how to reverse the change, just DM me to let me know.
How can I help?
If you'd like to help our with any step of the process, you can join our Discord and message me (Sacha) to start coordinating.
Please mention which question(s) of which survey(s) you'd like to help normalize, based on your own knowledge and interests.
Video Overview: https://www.youtube.com/watch?v=Sa2i5lYgmT8
What is this about?
As part of the State of JS, HTML, CSS, etc. surveys, we collect a lot of freeform data. In other words, data collected through plain textfields such as this:
As opposed to questions with predefined options such as this one:
This means that before we can visualize this freeform data, we need to normalize it down to a set of canonical tokens. For example, we need to define that answers
File upload is hard
andUploading files sucks!
both correspond to the samefile_uploading_issues
token, even though their raw string content might be different.What does the whole process look like?
This is basically a 4-step process (with a lot of going back-and-forth between steps).
1. Defining Tokens
First, we need to define the tokens that we will normalize towards. It helps a lot to have domain knowledge about the question topic (such as "forms pain points"), and looking at the raw dataset will also help reveal the most obvious patterns.
Our custom-built internal normalization dashboard highlights when the same strings appears multiple time in a row, in order to make it easier to detect reoccurring patterns.
2. Regular Expression Matching
Once we have a good idea of the most common tokens we want to match, we can define regular expressions to do so. This is done in YAML files:
Here, we are defining a new
low_priority
token (in this case, for the "accessibility pain points" question) that will be matched whenever the system encounters a string such asbuy-in
,buy in
,don't care
, etc.Also note the
parentId
property. Tokens can have parents, which in this case means that thelow_priority
pain point belongs to the overalla11y_company_issues
pain point, which is then useful to display results in a more hierarchical manner:We also support a set of custom modifiers that you can add to the end of a pattern to change how it will be matched:
Note: words also match plurals (
foo
will also matchfoos
)3. Manual Matching
Regular expressions will most likely not catch every answer. At that point, we can then move on to manual matching through the dashboard, where you can manually select tokens:
4. Review
The goal is to match about 70-80% of answers. At this point, we can safely assume that any remaining un-matched answers are niche enough that they would have a large impact on the resulting data visualizations anyway.
We can then review the resulting list, and make sure the matches make sense:
For example, here are all the matches for the
styling
token:Which questions are freeform?
When browsing survey results, we indicate that a chart is based on freeform data via a small indicator next to the question itself:
Or, if the chart includes both predefined and freeform data –for example, there was a set of predefined options along with an "other…" freeform option– with an indicator next to the specific bar that results from freeform data:
How can I get the raw data for a question?
If you'd like to obtain the raw data for a question without going through our dashboard, you can also do so via our API:
https://graphiql-h771.onrender.com/
Here is a sample query:
Questions to normalize
Here is a partial list of questions whose data needs to be wholly or partially normalized
State of HTML 2023
Trying Things Out
You can access the dashboard here: https://surveyadmin.vercel.app/admin/normalization/ (DM me on Discord for the credentials).
While you can freely click around the dashboard, I've highlighted below the buttons that will affect the dataset. If you're still getting the hang of things, please avoid clicking them! If you do and are not sure how to reverse the change, just DM me to let me know.
How can I help?
If you'd like to help our with any step of the process, you can join our Discord and message me (Sacha) to start coordinating.
Please mention which question(s) of which survey(s) you'd like to help normalize, based on your own knowledge and interests.