Descriptive link text audit doesn't make sense without localization

raffaelj commented 1 year ago

FAQ

[X] Yes, my issue is not about variability or throttling.
[X] Yes, my issue is not about a specific accessibility audit (file with axe-core instead).

URL

What happened?

Currently, the link-text audit ("Links do not have descriptive text") creates false positives, because it is not localized.

I noticed it when running tests against a whole sitemap of a multilingual website (German and English). The SEO score was 1.0 for all English pages and for the German home page. All other German pages had a score of 0.93. This was suspicious. Looking at the report data gave a hint:

{
    "audits": {
        // ...
        "link-text": {
            "id": "link-text",
            // ...
            "items": [
                {
                "href": "http://localhost/",
                "text": "Start"
                }
            ]
            }
        }
    }
}

So it was the main menu, that links to the home page.

Simplified markup:

English:

<!-- <html lang="en"> -->
<nav>
    <ul>
        <li><a href="/">Home</a></li>
        <li><a href="/about">About us</a></li>
    </ul>
</nav>

German:

<!-- <html lang="de"> -->
<nav>
    <ul>
        <li><a href="/">Start</a></li>
        <li><a href="/ueber">Über uns</a></li>
    </ul>
</nav>

Digging a bit deeper revealed a block list with with sets of localized strings, which explains the false positive.

See: https://github.com/GoogleChrome/lighthouse/blob/main/core/audits/seo/link-text.js#L25

The German translation of "home page" is "Startseite" or in short, the English word "Home" is as descriptive as the German word "Start" in that context.

I found an older issue, with the same problem for Portuguese, which was "solved" by removing "Início" from the block list.

See: https://github.com/GoogleChrome/lighthouse/issues/11026

I already found another example in the current block list: The Portuguese word "mais" is blocked, but in German "Mais" means corn, which should be a valid descriptive link text e. g. on a website about food.

Keeping the current behaviour will cause more and more false positives as soon as more localized sets of strings are added.

What did you expect?

The audit should check for the current locale. If there is no set with localized strings, the audit should report "Not applicable". Otherwise use the set of localized strings to check against.

What have you tried?

No response

How were you running Lighthouse?

node

Lighthouse Version

10.0.1

Chrome Version

Chromium 110.0.5481.177 stable

Node Version

v19.7.0

OS

Linux

Relevant log output

No response

connorjclark commented 1 year ago

I think filtering our blocklist based on the current locale is a great idea that we can do quickly.

We should discuss if we want to have our translators localize all the common bad-labels to improve this audit.

raffaelj commented 1 year ago

I had some thoughts about it and tried an implementation in the last days. It's not trivial, because we can't just check the document language. The audit is about the inner text, which can have it's own language definition(s).

This is my local test file:

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="utf-8">
        <title>language test</title>
    </head>
    <body>
        <a href="/">Home</a>
        <a href="/de" lang="de">Start</a>
        <a href="/de" lang="de-DE">Start</a>
        <a href="/de"><span lang="de-CH">Start</span></a>
        <a href="/info"><span lang="de-DE">click</span> <span lang="fr">here</span></a>
        <svg width="200px" height="18px">
            <a href="/info">
                <text x="0" y="18">here</text>
            </a>
            <a xlink:href="/info">
                <text x="40" y="18">here</text>
            </a>
            <a xlink:href="/info" lang="de">
                <text x="80" y="18">hier</text>
            </a>
            <a xlink:href="/info">
                <text x="120" y="18" xml:lang="de-DE">start</text>
            </a>
        </svg>
    </body>
</html>

My changes are here: https://github.com/raffaelj/lighthouse/tree/localize-link-text-audit

I tested it locally with

node cli http://localhost:8080 --only-audits="link-text" -GA --chrome-flags="--headless"

and it looks promising. I also ran yarn test to fix coding style issues and yarn smoke seo also doesn't throw any errors.

I don't fully understand the architecture. Too many scripts in package.json and I'm not really familiar with yarn...

So, it works in general, but I'm not sure about other edge cases. See:

Should I send a pull request with my current changes? Or is it better to wait for a discussion?

paulirish commented 1 year ago

@raffaelj awesome. Yeah this all makes sense. Really nice job exploring edge cases.

A PR would be great.

paulirish commented 1 year ago

FWIW the HTML spec points out the algorithm for determining language of a node: https://html.spec.whatwg.org/multipage/dom.html#attr-lang:~:text=To%20determine%20the%20language%20of%20a%20node%2C

Lighthouse does not support XHTML so I'm not sure how much the xml lang stuff is relevant... and.. i'd probably be fine with only supporting lang inside SVG...

raffaelj commented 1 year ago

@paulirish Sorry for the late reply. I didn't find any time in the last weeks to work on this problem. Also I wanted to implement some fallbacks from the language detection algorithm before sending a PR.

Than it took a while until I realized, that I can write the gatherer code in my local test file and have instant feedback in my browser instead of calling node cli... over and over again after every change. Now it looks like my code works with a huge sample of edge cases.

Lighthouse does not support XHTML so I'm not sure how much the xml lang stuff is relevant... and.. i'd probably be fine with only supporting lang inside SVG...

At first I wanted to argue against, but now I realized, that xlink:href and xml:lang are already deprecated.

See: https://developer.mozilla.org/en-US/docs/Web/SVG/Attribute/xlink:href and: https://developer.mozilla.org/en-US/docs/Web/SVG/Attribute/xml:lang

So I removed the xml stuff.

Fallbacks

According to to the lang attributes spec, there should be two fallbacks:

pragma-set-default-language
HTTP header "Content-Language"

The first one is non-conforming. Maybe we can skip it.

For the second one - I don't know, how to access http headers client-side. I marked this with a "TODO" in the PR.

Debugging

No errors in:

yarn eslint core/gather/gatherers/anchor-elements.js
yarn eslint core/audits/seo/link-text.js
yarn type-check

edit (2023-03-26): Also yarn type-check doesn't throw any errors anymore.

yarn smoke seo throws random errors now, but I'm pretty sure, it has nothing to do with my changes. Things like:

"The page did not paint any content. Please ensure you keep the browser window in the foreground during the load and try again. (NO_FCP)"
"Required Accessibility gatherer did not run."
"Required MetaElements gatherer did not run."
"Lighthouse was unable to download a robots.txt file"

In one run two of them appear, in another run a few more... as I said: random. It also happens, if I checkout the main branch. Not sure, if I have to restart my PC or if it is a problem with the current state:

# git log --oneline -n 5
fad9353c1 (HEAD -> localize-link-text-audit, origin/localize-link-text-audit) improved language code check in link-text audit
60547bdf3 refactored link-text lang detection (edge cases)
f7029e0fd Merge branch 'main' into localize-link-text-audit
2db068a68 (upstream/main, main) core: collect fetchpriority for images and rel=preload links (#14925)
a7891f363 (tag: v10.1.0) v10.1.0 (#14923)

edit: (2023-03-26) The errors from yarn smoke seo are probably caused by a combination of my window manager (i3wm) in tab mode and running multiple chrome instances (in the background). I was able to pass all tests, while switching fast between all open chrome windows.

Other thoughts

Maybe the link-text audit should be in the accessibility category instead of the current seo category. The whole audit sounds like WCAG 2.1 Link Purpose (In Context) (Level A). While reading that article, I had some more ideas, where the link-text audit should be skipped to prevent too much false positives:

If link is inside <nav>
If link is <abbr>

If the audit stays in the seo category, it should have a lower weight. In most languages of the world, the current implementation will be skipped (until many, many more lang strings are added). Also the test will produce false positives in valid contexts, e. g. <a href="https://en.wikipedia.org/wiki/Here">Here</a>.

My local test file

edit (2023-03-26): This local test file is already outdated to reflect latest changes (see commit history in my PR).

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="utf-8">
        <title>Edge cases for (non-) descriptive link texts</title>
    </head>
    <body>

        <h1>Edge cases for (non-) descriptive link texts</h1>

        <nav aria-label="Language">
            <!-- assert lang==en -->
            <a href="/en/lang-nav_assert-en" title="English">
                <span>English</span>
            </a>
            <!-- English title on hover, but link text is in German -->
            <!-- assert lang==de -->
            <a href="/de/lang-nav_assert-de" title="German" lang="en" rel="alternate" hreflang="de">
                <span lang="de">Deutsch</span>
            </a>
        </nav>

        <nav aria-label="Language">
            <!-- assert lang==en -->
            <a href="/en/lang-nav_assert-en" title="English">
                <span>English</span>
            </a>
            <!-- German title on hover, but link text is in English -->
            <!-- assert lang==en -->
            <a href="/de/lang-nav_assert-en" title="Deutsch" lang="de" rel="alternate" hreflang="de">
                <span lang="en">German</span>
            </a>
        </nav>

        <div>
            <h2>Links with lang region</h2>
            <!-- skipped, because it points to same site (audit tries to ignore anchor links) -->
            <a href="/">here</a>

            <a href="/lang-with-region_assert-en-US_fallback-to-en" lang="en-US">Start</a>
            <a href="/lang-with-region_assert-de-DE_fallback-to-de" lang="de-DE">Start</a>

            <a href="/lang-with-region_assert-de-CH_fallback-to-de"><span lang="de-CH">Start</span></a>
            <!-- unknown/mixed lang -->
            <a href="/mixed-lang_assert-unknown"><span lang="de-DE">click</span> <span lang="fr">here</span></a>
        </div>

        <div>
            <h2>Links with icons and text</h2>
            <a href="/link-with-icon_no-alt_assert-en">
                <img src="" alt="" />
                here
            </a>
            <a href="/link-with-icon_has-alt_assert-en">
                <img src="" alt="click" />
                here
            </a>
            <a href="/link-with-icon-and-span_has-alt_assert-en">
                <img src="" alt="click" />
                <span>here</span>
            </a>
            <a href="/link-with-icon-and-nested-span_has-alt_assert-en">
                <img src="" alt="click" />
                <span>click <span>here</span></span>
            </a>
            <!-- displayed text is "click click here", but link text is "click here" -->
            <a href="/link-with-nested-div-icon-span_has-alt_assert-en">
                <div>click <img src="" alt="click" /><span>here</span></div>
            </a>
        </div>

        <div>
            <h2>Links with German images and English text</h2>
            <a href="/link-with-german-img-and-en-text_no-alt_assert-en">
                <img src="" lang="de" />
                here
            </a><br />
            <a href="/link-with-german-img-and-en-text_has-alt_assert-en">
                <img src="" alt="click" lang="de" />
                here
            </a><br />
            <!-- displayed text is "click click here", but link text is "click here" -->
            <a href="/link-with-german-img-and-nested-span_has-alt_assert-en">
                <img src="" alt="click" lang="de" />
                <span>click <span>here</span></span>
            </a><br />
            <!-- displayed text is "clickclick here", but link text is "click here" -->
            <a href="/link-with-nested-div-german-img-span_has-alt_assert-unknown">
                <span>
                    click<img src="" alt="click" lang="de" />
                    <span>here</span>
                </span>
            </a>
        </div>

        <div>
            <h2>Inline SVG links</h2>
            <svg width="200px" height="18px">
                <!-- assert lang==en -->
                <a href="/svg-circle-text_assert-en">
                    <circle cx='2' cy='11' r='2'/>
                    <text x="5" y="18">here</text>
                </a>
                <!-- assert lang==en -->
                <a href="/svg-text_assert-en">
                    <text x="40" y="18">here</text>
                </a>
                <!-- assert lang==de -->
                <a href="/svg-text_assert-de" lang="de">
                    <text x="80" y="18">hier</text>
                </a>
                <!-- assert lang==en -->
                <a href="/svg-text_xml-lang_assert-en">
                    <text x="120" y="18" xml:lang="de-DE">start</text>
                </a>
            </svg>
        </div>

        <script>

  /** @param {HTMLElement|SVGElement|Text} node */
  function getTrimmedInnerText(node) {
    return node instanceof HTMLElement
      ? node.innerText.trim()
      : node.textContent.trim();
  }

  /** @param {HTMLAnchorElement|SVGAElement} node */
  function getTextLang(node, currentLang = '') {
    if (!currentLang) {
      const parentWithLang = node.closest('[lang]');

      // TODO: fallbacks to pragma-set-default-language or HTTP header
      currentLang = !parentWithLang ? '' : parentWithLang.getAttribute('lang');
    }

    const innerElsWithLang = node.querySelectorAll('[lang]');

    let innerTextLang = currentLang;

    if (innerElsWithLang.length) {
      const innerText = getTrimmedInnerText(node);

      for (const el of node.childNodes) {
        if (innerText === getTrimmedInnerText(el)) {
          if (el instanceof HTMLElement || el instanceof SVGElement) {
            const elLang = el.getAttribute('lang');
            const childrenWithLang = el.querySelectorAll('[lang]');

            if (!childrenWithLang.length) {
              innerTextLang = elLang || currentLang;
              break;
            } else {
              // recursive call
              innerTextLang = getTextLang(el, elLang || currentLang);
            }
          } else {
            innerTextLang = currentLang;
          }
        } else {
          innerTextLang = '';
        }
      }
    }

    return innerTextLang;
  }

// display lang and asserted lang next to found link
const anchorElements = document.querySelectorAll('a');
anchorElements.forEach(node => {
  const lang = getTextLang(node);
  let newEl = document.createElement('em');
  newEl.innerText = lang;

  const href = node.getAttribute('href') || '';
  if (href) {
    const matches = href.match(/(?:assert-)([a-zA-Z-]+)(?:_|$)/)
    const assert = matches && matches[1] ? matches[1] : null;
    if (assert) {
      newEl.innerText += ` (${assert})`;
      if ((assert !== lang) && (!lang && assert !== 'unknown')) {
        newEl.style = 'color:red';
      }
    }
  }

  node.after(newEl);

});
        </script>
    </body>
</html>

raffaelj commented 1 year ago

@paulirish @connorjclark Any thoughts or news?

I feel a bit like I sold my soul (by signing that CLA) for nothing. It's no problem to wait or even to leave this issue and my PR open forever. I just would like to know if my suggestions are welcome and if there is a chance to merge them. Otherwise I would reject the CLA (if possible).

Jacco-V commented 1 year ago

So, any updates on this issue?

benschwarz commented 1 year ago

So, any updates on this issue?

If there's updates, they will be posted.

GoogleChrome / lighthouse