Closed jonathanmayer closed 4 years ago
Notes from discussion with @PranayAnchuri this morning:
manual
for redirect following, the webRequest.onHeadersReceived
and webRequest.onBeforeRedirect
events don't fire. That means we have no way to read the redirection target URL.webRequest.onBeforeRedirect
.webRequest.onBeforeRequest
and if the redirected request isn't hitting another shortener (or there's too much recursion) we cancel the request.Update: back to using the Fetch API with redirect
set to manual
, plus a webRequest.onHeadersReceived
listener. Details in #42.
Google news links are not resolved by making network requests. The html of news.google.com maintains a mapping (it's hidden in a script tag) between the article id and the corresponding source. To extract this mapping for an article, we need to create a regex with article id and scan the innerHtml of the document to find matches for a https type group around the article. This could get messy. Moreover, even after extracting the mapping we wouldn't be able to tell if the article is in current viewport. Following regex shows an example
It looks like Google News both stores the resolved link in the page (for the sharing pane) and resolves the link with an HTTP(S) request (when clicked). Weird. Given how difficult it is to extract the resolved links from the page, I think we should stick with HTTP(S) resolution for now.
Edit: here are my notes on the Google News URL shim from Slack on 11/23.
@PranayAnchuri poked at the https://news.google.com/article/[blob] link shim a bit today. It’s pretty messy… it’s a base64url encoded protocol buffer, with terminal characters dropped. Sometimes the format is easy to decode (e.g., field 1: 19, field 4: [URL], optional field 26: [AMP URL], or field 1: 32, field 4: [YouTube video ID]). I haven’t been able to figure out how to decode a number of URLs, though (e.g., field 1: 2, then who knows). If you’re interested, the GCHQ CyberChef tool is convenient for this (just set the workflow to base64 decode with the URL safe option, then protobuf decode).
For now, we should probably just resolve these links the same way as link shorteners. One extra wrinkle, though… if you hit a news.google.com link shim with a browser User-Agent header, you get an HTML document that does some tracking and a JS redirect, rather than an HTTP Location response header. If you hit the link shim with a non-browser User-Agent, though, you get an HTTP Location response header. So we’ll have to do User-Agent spoofing when resolving news.google.com link shims.
Done with all the issues mentioned above.
Feedback from an initial code review:
debugLog
line has the wrong module name.isResolving
variable doesn’t seem to be necessary.initialized
andinitialize
work in the PageEvents module.onRequest
event listener, since the same information is available in theonRedirect
onHeadersReceived
event listener.onResponse
event listener, since it isn’t really accurate—the absence of an HTTP(S) redirect doesn’t tell us anything about whether a resource has an HTML or JS redirect.manual
redirect option). Fully loading all the shortened/shimmed links that the user is exposed to is not a good idea. Instead, we should try to limit the HTTP(S) traffic to looking up redirects.onHeadersReceived
listener to resolve shortened links. See #42.User-Agent
header to empty when using the Fetch API, since some shorteners (e.g., Google and Twitter) use JavaScript redirects rather than HTTP redirects for recognizable browsers.<all_urls>
.