EqualifyEverything / equalify

A web accessibility platform, managing issues by integrating with A11Y services.
https://equalify.app
Other
118 stars 21 forks source link

What metrics can we use to score accessibility success? #76

Closed bbertucc closed 1 year ago

bbertucc commented 2 years ago

The most unique feature of Equalify is its ability to scan every page on a site. That feature could be promoted with a website-wide score instead of a page score, as suggested by @azdak. This could be introduced into the "Sites" view.

kreynen commented 2 years ago

I really like the way most pages on https://www.unl.edu/ have a QA Test link in the footer to page level summary reports https://webaudit.unl.edu/sites/1048/pages/13710099/

bbertucc commented 2 years ago

I really like the way most pages on https://www.unl.edu/ have a QA Test link in the footer to page level summary reports https://webaudit.unl.edu/sites/1048/pages/13710099/

That would not only be a great tool, but a great way to promote the Equalify project. I'm imagining a simple badge we create that yanks the score from an Equalify API. That or we can just add the option for a publicly available report.

I imagine these updates would be important as we start to build out the reports. My plan was to get enough data in the DB to start building meaningful reports. But plans can change if anyone submits a PR or adds a wave of 👍.

bbertucc commented 2 years ago

Amending this title to @kreynen's great feedback. This will could also expanded report building sprint...

bbertucc commented 2 years ago

Curious, is there any standard pass/fail grading like UNL? We could have users set what Pass/Fail was, but perhaps there's some secret higher ed formula that I'm unaware of..

bbertucc commented 1 year ago

@mgifford added a grade onto purple hats (see #127). I am curious how he came up with that grade?

I heard from some folks who think a grading system is "arbitrary" and others who love it.

mgifford commented 1 year ago

Grading is arbitrary. It is true that you can't manage what you can't measure. It is also true that not everything that matters can be measured. At least not with automated tools. So we're in a situation where we have to try to use it responsibly and educate our users about its value.

Having a 95 or even 100 score on Google Lighthouse or Siteimprove does not mean that your website is more or less accessible to PwD than a site with a lower score. It is probably a good indication, but not always.

It can be a good motivator though. We like to measure things, departments want to be at the head of a leaderboard.

Take a look here https://github.com/CivicActions/purple-hats/blob/master/mergeAxeResults.js#L401

    // Scoring for grade
    // Score  = (critical*3 + serious*2 + moderate*1.5 minor) / urls*5
    // A+ = 0 ; A <= 0.1 ; A- <= 0.3 ;
    // B+ <= 0.5 ; B <= 0.7 ; B- <= 0.9 ;
    // C+ <= 2 ; C <= 4 C- <= 6 ;
    // D+ <= 8 ; D <= 10 ; D- <= 13 ;
    // F+ <= 15 ; F <= 20 ; F- >= 20 ;
mgifford commented 1 year ago

Might be good to be able to turn on/off the grade per site, but what I wanted to do was be able to prioritize things so that critical issues were being managed more than the others, but also that we were counting the volume so that it was easier.

What Google Lighthouse did was definitely more refined https://developer.chrome.com/docs/lighthouse/accessibility/scoring/

They are doing it on a page level though. Not that this can't be aggregated across a set of pages.

mgifford commented 1 year ago

Oh ya, messaging... You can also see in the code how the Grades come along with messages:

A+ - "No axe errors, great! Have you tested with a screen reader?" A - "Very few axe errors left! Don't forget manual testing." A- - "So close to getting the automated errors! Remember keyboard only testing." B+ - "More work to eliminate automated testing errors. Have you tested zooming the in 200% with your browser." B - "More work to eliminate automated testing errors. Are the text alternatives meaningful?" B- - "More work to eliminate automated testing errors. Don't forget manual testing." C+ - "More work to eliminate automated testing errors. Have you tested in grey scale to see color isn't conveying meaning?" ....

bbertucc commented 1 year ago

I'm thinking the grade is an added integration. An "Equalify Score" integration.

We still want to be able to run lots of different types of scans (ie- WAVE, axe-core, language, ..) The integration would have to be open to supporting (or not supporting) the additional scans.

Perhaps a user could customize a scoring rubric so the grades didn't seem so arbitrary?

bbertucc commented 1 year ago

Update: #204 brings this issue back into importance. Still need to find that scoring metric! @mgifford's seems like the best (or only) real option thus far.

bbertucc commented 1 year ago

Here's the lighthouse scoring metric: https://developer.chrome.com/docs/lighthouse/accessibility/scoring/#:~:text=The%20Lighthouse%20Accessibility%20score%20is,partially%20passing%20an%20accessibility%20audit.

bbertucc commented 1 year ago

Interesting, the US Department of Justice discourages a percentage-based approach in their latest proposed web accessibility rules. See https://www.federalregister.gov/d/2023-15823/p-456 and https://www.federalregister.gov/d/2023-15823/p-457

To those critiques they offer another option, which nixes the ideas of a score and favors robust reports that show actions taken over time: https://www.federalregister.gov/d/2023-15823/p-463

Curious if @mgifford and others have feedback on what the DOJ proposes.

bbertucc commented 1 year ago

I created a sample scoring system here: https://docs.google.com/spreadsheets/d/1GCZlsa83V4NjF5QBDkPJ4CTxZrz0jb0_KyHbb-Ee5ZA/edit?usp=sharing

Does that satisfy your wildest dreams @mgifford?

mgifford commented 1 year ago

It needs more documentation about what this means.

bbertucc commented 1 year ago

Good point @mgifford. Here's an example user story that illustrates the feature and math we're using:

  1. Gloria, an accessibility expert who manages thousands of Example University's (EUs) web pages, is tasked to quickly articulate the state of accessibility for EU's main website, example.edu. Her boss only cares about pages related to example.edu and values WCAG 2.1 AA-related alerts. EU leadership also has a specific goal of equalifying alternative text alerts.
  2. Gloria creates a new report in Equalify. She adds filters so example.edu is the only property that is showing. She also only includes tags related to WCAG 2.1 AA and sets the weight of alt-related tags to "2" instead of it's default "1".
  3. Equalify then filters all alerts. There are a total of 500 alerts that meet the filter specifications. 150 of those alerts are alt-related. 300 alerts are equalified. 50 of those equalified alerts are alt-related. To determine the grade, Equalify does the following math: equalified alerts (502 + 2501) / total possible score (1502 + 3501) = grade (54%: D).
  4. Management uses that score as a marker for fixes to happen. After three months, Gloria can then clearly show progress in accessibility, moving from a "D" to an "A" by equalifying alerts.

Does that clarify things? Would people actually have that use case? Are their other use cases to consider when we're creating a score to add to reports? (I should note that a score is optional, so not everyone will need to have it.)

bbertucc commented 1 year ago

Closing this in favor of #242