Improvements to the LinkResolution module

jonathanmayer commented 4 years ago

Feedback from an initial code review:

[x] The debugLog line has the wrong module name.
[x] The isResolving variable doesn’t seem to be necessary.
[x] We should wait to register the web request listener(s) until the module gets used. See how initialized and initialize work in the PageEvents module.
[x] We can ditch the onRequest event listener, since the same information is available in the ~~onRedirect~~ onHeadersReceived event listener.
[x] We can ditch the onResponse event listener, since it isn’t really accurate—the absence of an HTTP(S) redirect doesn’t tell us anything about whether a resource has an HTML or JS redirect.
[x] We shouldn’t return the entire set of redirects by default, just the final resolved URL (or an indication that resolution failed). I don’t have a preference on whether we provide optional functionality to return the entire set of redirects.
[x] We should disable automatic redirect following in the Fetch API (using the manual redirect option). Fully loading all the shortened/shimmed links that the user is exposed to is not a good idea. Instead, we should try to limit the HTTP(S) traffic to looking up redirects.
[x] We should use an onHeadersReceived listener to resolve shortened links. See #42.
[x] We should drop as many headers as possible when using the Fetch API, especially cookies (to avoid messing with the user’s experience in any way).
[x] We should set the User-Agent header to empty when using the Fetch API, since some shorteners (e.g., Google and Twitter) use JavaScript redirects rather than HTTP redirects for recognizable browsers.
[x] We should only be listening to HTTP(S) requests to known shortener domains, rather than <all_urls>.
[x] We should only be listening to HTTP(S) requests that originate from the extension, rather than ordinary web content.
[x] The timeout for HTTP(S) requests should be a constant at the top of the module.
[x] The module should be consistent in its use of var and let in variable declarations.
[x] The set of shortener domains should live somewhere in /WebScience/ (e.g., /WebScience/dependencies/), since it isn’t specific to the news study.
[x] It looks like the module is missing link shim support. We need to be able to detect and parse those in the module, too.
[x] We should be consistent about using descriptive names for functions and variables. For example: getInitial - get initial what? getChainLength - what chain? getLatest - the latest of what? respond - respond to what? nredirects - what does this do? shortenerLength - length of what? store - storing what? All of these (and similar) should have self-explanatory names.
[x] We should have comments that explain, at minimum, each of the functions and constants.
[x] We should support Google and Google News link shims.
[x] We should keep track of link resolution errors, to make sure that our assumptions (e.g., that rate limiting and JavaScript-based redirects won't be common) hold up.

jonathanmayer commented 4 years ago

Notes from discussion with @PranayAnchuri this morning:

If we set the Fetch API to use manual for redirect following, the webRequest.onHeadersReceived and webRequest.onBeforeRedirect events don't fire. That means we have no way to read the redirection target URL.
As a workaround, here's what we're doing:
- We issue an ordinary GET request with the Fetch API, following redirects.
- We watch for the redirect with webRequest.onBeforeRedirect.
- We watch for the redirected request with webRequest.onBeforeRequest and if the redirected request isn't hitting another shortener (or there's too much recursion) we cancel the request.
We're not going to use HEAD requests, for now, since they might produce unexpected behavior from link shorteners.

jonathanmayer commented 4 years ago

Update: back to using the Fetch API with redirect set to manual, plus a webRequest.onHeadersReceived listener. Details in #42.

PranayAnchuri commented 4 years ago

Google news links are not resolved by making network requests. The html of news.google.com maintains a mapping (it's hidden in a script tag) between the article id and the corresponding source. To extract this mapping for an article, we need to create a regex with article id and scan the innerHtml of the document to find matches for a https type group around the article. This could get messy. Moreover, even after extracting the mapping we wouldn't be able to tell if the article is in current viewport. Following regex shows an example

https://regexr.com/4rf57

jonathanmayer commented 4 years ago

It looks like Google News both stores the resolved link in the page (for the sharing pane) and resolves the link with an HTTP(S) request (when clicked). Weird. Given how difficult it is to extract the resolved links from the page, I think we should stick with HTTP(S) resolution for now.

Edit: here are my notes on the Google News URL shim from Slack on 11/23.

@PranayAnchuri poked at the https://news.google.com/article/[blob] link shim a bit today. It’s pretty messy… it’s a base64url encoded protocol buffer, with terminal characters dropped. Sometimes the format is easy to decode (e.g., field 1: 19, field 4: [URL], optional field 26: [AMP URL], or field 1: 32, field 4: [YouTube video ID]). I haven’t been able to figure out how to decode a number of URLs, though (e.g., field 1: 2, then who knows). If you’re interested, the GCHQ CyberChef tool is convenient for this (just set the workflow to base64 decode with the URL safe option, then protobuf decode).

For now, we should probably just resolve these links the same way as link shorteners. One extra wrinkle, though… if you hit a news.google.com link shim with a browser User-Agent header, you get an HTML document that does some tracking and a JS redirect, rather than an HTTP Location response header. If you hit the link shim with a non-browser User-Agent, though, you get an HTTP Location response header. So we’ll have to do User-Agent spoofing when resolving news.google.com link shims.

PranayAnchuri commented 4 years ago

Done with all the issues mentioned above.

citp / news-disinformation-study

Improvements to the LinkResolution module #36