citp / news-disinformation-study

A research project on how web users consume, are exposed to, and share news online.
8 stars 2 forks source link

Improvements to the LinkResolution module #36

Closed jonathanmayer closed 4 years ago

jonathanmayer commented 4 years ago

Feedback from an initial code review:

jonathanmayer commented 4 years ago

Notes from discussion with @PranayAnchuri this morning:

jonathanmayer commented 4 years ago

Update: back to using the Fetch API with redirect set to manual, plus a webRequest.onHeadersReceived listener. Details in #42.

PranayAnchuri commented 4 years ago

Google news links are not resolved by making network requests. The html of news.google.com maintains a mapping (it's hidden in a script tag) between the article id and the corresponding source. To extract this mapping for an article, we need to create a regex with article id and scan the innerHtml of the document to find matches for a https type group around the article. This could get messy. Moreover, even after extracting the mapping we wouldn't be able to tell if the article is in current viewport. Following regex shows an example

https://regexr.com/4rf57

jonathanmayer commented 4 years ago

It looks like Google News both stores the resolved link in the page (for the sharing pane) and resolves the link with an HTTP(S) request (when clicked). Weird. Given how difficult it is to extract the resolved links from the page, I think we should stick with HTTP(S) resolution for now.

Edit: here are my notes on the Google News URL shim from Slack on 11/23.

@PranayAnchuri poked at the https://news.google.com/article/[blob] link shim a bit today. It’s pretty messy… it’s a base64url encoded protocol buffer, with terminal characters dropped. Sometimes the format is easy to decode (e.g., field 1: 19, field 4: [URL], optional field 26: [AMP URL], or field 1: 32, field 4: [YouTube video ID]). I haven’t been able to figure out how to decode a number of URLs, though (e.g., field 1: 2, then who knows). If you’re interested, the GCHQ CyberChef tool is convenient for this (just set the workflow to base64 decode with the URL safe option, then protobuf decode).

For now, we should probably just resolve these links the same way as link shorteners. One extra wrinkle, though… if you hit a news.google.com link shim with a browser User-Agent header, you get an HTML document that does some tracking and a JS redirect, rather than an HTTP Location response header. If you hit the link shim with a non-browser User-Agent, though, you get an HTTP Location response header. So we’ll have to do User-Agent spoofing when resolving news.google.com link shims.

PranayAnchuri commented 4 years ago

Done with all the issues mentioned above.