citp / news-disinformation-study

A research project on how web users consume, are exposed to, and share news online.
8 stars 2 forks source link

Update the docs for the collected data #73

Closed Dexterp37 closed 3 years ago

Dexterp37 commented 3 years ago

Public documentation is a requirement for the Mozilla Data Collection & Review process.

This PR adds narrative around the collected data and renames instances of "telemetry" to either data collection or Ion Platform (because this is not really telemetry :) ).

akohlbre commented 3 years ago

Thanks for doing all this -- I hadn't realized we needed it!

I have a new version of the schema awaiting approval over at https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/635. It doesn't change much from what you have here, just adds three fields: visThreshold: the amount of time the exposed link was visible, as a histogram bucket prevExposed: the number of shared links which the user had previously seen online source: for a reshared link on Facebook, whether the reshared post came from a page or a person

A small thing throughout: I used the word "domain" in a lot of places (referrerDomain, etc), but that's not 100% accurate, since we also include part of the path in a few very specific scenarios, such as when the URL in question is the official social media page of a news organization. For example, the URL https://www.facebook.com/nytimes/posts/10152495110569999 would get reported under facebook.com/nytimes (most other URLs would be shortened to just facebook.com). I'm not sure whether it's necessary to include that detail in the docs, but wanted to clarify since my naming gives the wrong impression.

Dexterp37 commented 3 years ago

Thanks for doing all this -- I hadn't realized we needed it!

No worries :) Happy to help!

I have a new version of the schema awaiting approval over at mozilla-services/mozilla-pipeline-schemas#635. It doesn't change much from what you have here, just adds three fields: visThreshold: the amount of time the exposed link was visible, as a histogram bucket prevExposed: the number of shared links which the user had previously seen online source: for a reshared link on Facebook, whether the reshared post came from a page or a person

I updated the docs to add these.

A small thing throughout: I used the word "domain" in a lot of places (referrerDomain, etc), but that's not 100% accurate, since we also include part of the path in a few very specific scenarios, such as when the URL in question is the official social media page of a news organization. For example, the URL https://www.facebook.com/nytimes/posts/10152495110569999 would get reported under facebook.com/nytimes (most other URLs would be shortened to just facebook.com). I'm not sure whether it's necessary to include that detail in the docs, but wanted to clarify since my naming gives the wrong impression.

Please feel free to recommend changes on the docs as you see fit (or even create follow-up PRs!).

Let me know if this looks good overall, and I'm not misinterpreting things :-)

akohlbre commented 3 years ago

Please feel free to recommend changes on the docs as you see fit (or even create follow-up PRs!).

I'm tempted to leave it as-is, since "domain" does represent the approximate level of granularity at which we're collecting data, even when we include part of the path (and the cases where we do so are rare). There also isn't another word that encapsulates this idea of "only the domain, unless the beginning of the path tells us it's a news org's social media page". Is it ok for the docs to have something that's generally correct, but technically inaccurate like this?

Let me know if this looks good overall, and I'm not misinterpreting things :-)

Yes, I took a quick read through and all looks reasonable!

Dexterp37 commented 3 years ago

Cool, thanks @akohlbre ! Feel free to r+ so that we can merge :-D (or merge away!)