Use a better link tracking approach in the LinkExposure module

jonathanmayer commented 4 years ago

Recommendations from @biancadanforth.

https://github.com/citp/web-science/pull/43#pullrequestreview-331674942

Implement a solution for keeping track of links you have already seen that does not require modifying the page, and ideally, does not require using setInterval.

I describe a few options, though I recommend the Intersection Observer API if possible.

If you do keep setInterval, there should be a way to cancel it.

https://github.com/citp/web-science/pull/43#discussion_r357494412

Is it possible to do this check without setting anything on the page? Why not use a WeakMap or WeakSet? Better still, to avoid the setInterval and avoid creating an unbounded map or set, why not use a mutation observer or the Intersection Observer API?

Does "_visited" here refer to a link that you have already seen and processed? That label is confusing, as there is also the idea that a link has been visited by the user, in which case you can collect visited links with a CSS pseudo-class, :visited.

biancadanforth commented 4 years ago

Here's a code snippet using Intersection Observer that seems to work well provided we know a priori which links we want to observe (and we define an isVisible function that checks for visibility: hidden and friends):

function handleIntersection(entries, observer) {
  entries.forEach(entry => {
    const {isIntersecting, target} = entry;
    if (isIntersecting
      && !observedLinks.has(target)
      && isVisible(target)
    ) {
      observedLinks.add(target);
      if (testForMatch(urlMatcher, target.href, target)) {
        const {width, height} = entry.boundingClientRect;
        sendMessageToBg("WebScience.linkExposureInitial", {
          href: target.href,
          size: {
            width,
            height
          }
        });
      }
      observer.unobserve(target);
    }
  });
}

const options = { threshold: 1 };
const observedLinks = new WeakSet();
const observer = new IntersectionObserver(handleIntersection, options);
const aElements = Array.from(document.body.querySelectorAll("a[href]"));
for (const aElement of aElements) {
  observer.observe(aElement);
}

I don't have any good ideas at the moment for <a> elements that are added dynamically like in a Twitter feed, but I'll think on it more, and I've asked some others for their thoughts on it.

PranayAnchuri commented 4 years ago

@biancadanforth Thanks for the code snippet. I have previously tried a very similar approach but it didn't trigger all the intersection events I expected from Twitter feed. Following is the function that I used to observe elements of interest. It returns a promise that resolves to first entry:

/**
 * Function to observe intersection of dom elements with viewport
 * @param {DOMElement} targetElement element to observe for intersection with viewport
 * @param {*} threshold intersection ratio
 * @returns promise that resolves to element when it intersects with viewport
 */
function observeElement(targetElement, threshold) {
    new Promise((resolve, reject) => {
        const observerOptions = {
            root: null, // Document viewport
            rootMargin: "0px",
            threshold // Visible amount of item shown in relation to root. 1.0 dictates that every pixel of element is visible.
        };
        const observer = new IntersectionObserver((entries, observer) => {
            targetElement.isObserved = true;
            if (
                !entries[0].isIntersecting ||
                entries[0].intersectionRatio < threshold
            ) {
                return;
            }
            observer.disconnect();
            return resolve(entries[0]);
        }, observerOptions);

        observer.observe(targetElement);
    });
}

jonathanmayer commented 4 years ago

A few thoughts on @PranayAnchuri's snippet, in roughly linear order:

The JSDoc markup is missing types for the threshold parameter and the return value, and the type of the targetElement parameter should be Element or HTMLElement.
From a quick skim of how Firefox implements the Intersection Observer API, @biancadanforth's approach of one IntersectionObserver with multiple observed Elements seems clearly preferable to the approach of one IntersectionObserver per observed Element. The latter approach has unnecessary memory and computation overhead.
Setting up a new Promise for each observed element also seems clunky. Using one callback function and data structure (as @biancadanforth implemented) is lighter weight and more readable.
The second return in the IntersectionObserver callback is unnecessary.

Edit: chatted with @PranayAnchuri. Here's a sketch of an overall approach.

Create a WeakMap to associate data with link elements, rather than using DOM expando attributes.
Create a live list of link elements in the page body with document.body.getElementsByTagName. (I think this might be a bit better performance than periodically calling document.body.querySelectorAll. That should also work fine.)
Iterate the link elements list. a. Add the link element to the WeakMap. b. Test the link URL for a match. If there is no match, store in the WeakMap that this is a link to ignore, and move to the next link element. c. Store in the WeakMap that the visibility for this link is unknown. d. Add the link element to the IntersectionObserver.
When the IntersectionObserver callback fires: a. If a link element changes visibility state to visible, mark that it’s currently visible and the timestamp. b. If a link element changes visibility state from visible to not visible, mark that it’s currently not visible, calculate the visibility duration, and update the total visibility duration in the WeakMap.
1. On a periodic interval (e.g., ~5 seconds), or when the page is unloading, iterate the link elements list. a. If there is a link element that isn’t in the WeakMap, do steps 3.a-3.d. b. If there is a link element that has been visible above a certain threshold period of time, send the link URL and element dimensions to the background page, remove the link from the IntersectionObserver, and mark the link as a link to ignore.

This hybrid approach would get us the native efficiency and accuracy of using an IntersectionObserver and the DOM update handling of a periodic timer.

A couple relevant API quirks that I stumbled across while thinking this through:

The IntersectionObserverCallback function is queued when an element first gets observed, providing an initial IntersectionObserverEntry for that element. We don't have to separately check a link element's visibility before adding it to an IntersectionObserver.
The Intersection Observer v2 API will include support for visibility checking. I'm not sure if it's going to land in Firefox anytime soon, though.

biancadanforth commented 4 years ago

It's a good idea to only observe link elements whose href attribute is a match. That would be an improvement on my implementation.

I only skimmed over your revised approach, Jonathan, but on the surface it seems like a pretty good one.

Re: dynamically added link elements; I like your idea of a live list approach if that will work, as unfortunately it looks like either polling or setting up a Mutation Observer for newly added links are the only alternatives there without modifying native code (HTMLAnchorElement::BindToTree) (thanks to @emilio).

Also, a related but separate issue: I don't think any of these approaches will cover anchor elements in the shadow DOM. Perhaps it's worth checking to see how prevalent use of a shadow DOM is for some of the most bleeding edge news sites of interest (e.g. Twitter, Facebook, ...)? My gut tells me this is pretty experimental still and not in wide use, but in case you want to check: @emilio recommended using openOrClosedShadowRoot to see if an element has a shadow root attached. If it does, you can call querySelectorAll on the shadow root to get all the anchor elements in that subtree.

jonathanmayer commented 4 years ago

Also, a related but separate issue: I don't think any of these approaches will cover anchor elements in the shadow DOM. Perhaps it's worth checking to see how prevalent use of a shadow DOM is for some of the most bleeding edge news sites of interest (e.g. Twitter, Facebook, ...)? My gut tells me this is pretty experimental still and not in wide use, but in case you want to check: @emilio recommended using openOrClosedShadowRoot to see if an element has a shadow root attached. If it does, you can call querySelectorAll on the shadow root to get all the anchor elements in that subtree.

Very good point. I just checked Facebook's feed, Twitter's feed, and Google's search results page to make sure there isn't any use of shadow DOM. The Chrome usage statistics for Element.attachShadow() also reflect low adoption so far. I agree that this isn't an issue for now.

biancadanforth commented 4 years ago

I was just profiling master, and it looks like this has been implemented? Mostly in this commit.

Screen Shot 2020-01-06 at 7 40 40 PM

If I zoom in on one of the red areas (that's jank), the Intersection Observer work (the yellow slivers on the bottom half of the screenshot) doesn't even coincide with the jank. In other words, this is an extremely performant solution!

Screen Shot 2020-01-06 at 7 44 02 PM

jonathanmayer commented 4 years ago

Excellent! We try to move fast 😃

PranayAnchuri commented 4 years ago

Done.

citp / news-disinformation-study

Use a better link tracking approach in the LinkExposure module #46