Classify textlist items individually, not all together

LeaVerou commented 8 months ago

The textlist component has two benefits:

It increases number of points entered
It facilitates analysis, as it breaks down (what would have been) a monolithic freeform response into multiple largely independent chunks.

Right now, surveyadmin analyzes textlist responses as a whole, effectively throwing away (2) and using them as merely an input affordance for the same content as what a <textarea> would accept. This is not just a theoretical problem: it means that there are items entered that do not get analyzed at all and we don't even notice. The dashboard UI marks the whole response as "normalized" because there is at least one category associated with it, so we may never discover that we have not analyzed all items entered. Whereas if the matching was more granular, these would show up as unnormalized.

For example, take a look at this response:

Basically, none of the 2nd, 3rd, or 4th items have been normalized, yet the whole response is showing up as "normalized", which means we don't even get it through @ShaineRosewel's model, since that is only used as a fallback for unclassified responses.

SachaG commented 8 months ago

Got it, thanks for the detailed explanation. For reference, this is how textlist normalization results are currently stored:

    "forms_pain_points": {
      "raw": [
        ":invalid applies from tehs tart, :user-invalid is not supported enough yet",
        "The constraint validation API relies on devs when it comes to i18n which can end up quite messy",
        "number and date input types aren't so useful because of their UX and specs (eg. the e letter accepted in a number input: I know why, but need to patch this by myself if don't want this behavior)",
        "pattern is really handy, but RegExes are hard and are too easy to get wrong"
      ],
      "normalized": [
        "validation",
        "number_input",
        "input_type_date"
      ],
      "patterns": [
        "/\\bvalidation\\b/i",
        "/\\bnumber( |-|_|.){0,2}input\\b/i",
        "/\\bdate input\\b/i"
      ]
    },

So we have 4 pain points listed but only 3 normalizations, and we don't know which normalization relates to which pain point.

The ideal data structure would probably look something like this instead:

{
  "forms_pain_points": [
    {
      "raw": ":invalid applies from tehs tart, :user-invalid is not supported enough yet",
      "normalized": null,
      "patterns": null
    },
    {
      "raw": "The constraint validation API relies on devs when it comes to i18n which can end up quite messy",
      "normalized": ["validation"],
      "patterns": "/\\bvalidation\\b/i"
    },
    {
      "raw": "number and date input types aren't so useful because of their UX and specs (eg. the e letter accepted in a number input: I know why, but need to patch this by myself if don't want this behavior)",
      "normalized": ["number_input", "input_type_date"],
      "patterns": ["/\\bnumber( |-|_|.){0,2}input\\b/i", "/\\bdate input\\b/i"]
    },
    {
      "raw": "pattern is really handy, but RegExes are hard and are too easy to get wrong",
      "normalized": null,
      "patterns": null
    }
  ]
}

Our system doesn't support it at the moment though, and even if we did change it, we would also have to ensure that we also support previously normalized data which still follows the old data structure.

LeaVerou commented 8 months ago

Orthogonal to this issue, but since each regex belongs to a single category, it may be good to preserve that link too (for all questions, not just textlists). I.e. instead of two parallel arrays (which are not really parallel, because what if multiple regexes match from the same category?)

    "normalized": [
        "validation",
        "number_input",
        "input_type_date"
    ],
    "patterns": [
        "/\\bvalidation\\b/i",
        "/\\bnumber( |-|_|.){0,2}input\\b/i",
        "/\\bdate input\\b/i"
    ]

What if normalized became an object literal with metadata?

    "normalizedDetails": {
        "validation": {
            "patterns": ["/\\bvalidation\\b/i"]
        },
        "number_input": {
            "patterns": ["/\\bnumber( |-|_|.){0,2}input\\b/i"]
        },
        "input_type_date": {
            "patterns": ["/\\bdate input\\b/i"]
        }
    }

You can use accessors for backwards compatibility:

    get normalized () {
        return Object.keys(this.normalizedDetails);
    }

    get patterns () {
        return Object.values(this.normalizedDetails).map(o => o.patterns).flat();
    }

Now on to the issue at hand. I wonder if augmenting the current data model may get us there faster than trying to get to the ideal data model. Especially with the changes above, it’s trivial to add metadata to each category about the specific items matched:

    "normalizedDetails": {
        "validation": {
            "patterns": ["/\\bvalidation\\b/i"],
            "submatches": [1]
        },
        "number_input": {
            "patterns": ["/\\bnumber( |-|_|.){0,2}input\\b/i"],
            "submatches": [2]
        },
        "input_type_date": {
            "patterns": ["/\\bdate input\\b/i"],
            "submatches": [2]
        }
    }

Note that submatches is not specific to textlist: it could accommodate other forms of structured input too.

One downside of that is that it's a bit involved to get the actual unnormalized items, but not too huge of a deal:

    get unnormalized () {
        let normalizedIndices = new Set(Object.values(this.normalizedDetails).map(o => o.submatches).flat());
        return this.raw.filter((_, i) => typeof i === "number" && !normalizedIndices.has(i));
    }

SachaG commented 7 months ago

I think I'll go with a data structure like this:

  forms_pain_points: {
    normalized: ["validation", "number_input", "input_type_date", "pattern"],
    metadata: [
      {
        raw: ":invalid applies from tehs tart, :user-invalid is not supported enough yet",
        tokens: [],
      },
      {
        raw: "The constraint validation API relies on devs when it comes to i18n which can end up quite messy",
        tokens: [{ id: "validation", pattern: "dsafksdljfsdkjf" }],
      },
      {
        raw: "number and date input types aren't so useful because of their UX and specs (eg. the e letter accepted in a number input: I know why, but need to patch this by myself if don't want this behavior)",
        tokens: [
          { id: "number_input", pattern: "dsafksdljfsdkjf" },
          { id: "input_type_date", pattern: "sdsdfsdfsd" },
        ],
      },
      {
        raw: "pattern is really handy, but RegExes are hard and are too easy to get wrong",
        tokens: [{ id: "pattern", pattern: "dsafksdljfsdkjf" }],
      },
    ],
  },

we keep the current normalized field as it is, since this is the one used for the results
we delete raw and patterns since they are only used for normalization
we replace them with a new metadata field that contains metadata about the normalization, including the original raw strings and a matches array of objects with the matched token and the regex pattern used to match it.

SachaG commented 7 months ago

I'm also splitting up the sub-answers:

e.g. each line in that screenshot is a different sub-answer from the textList component, taken from different responses.

Devographics / Monorepo

Classify textlist items individually, not all together #331