Define verifiable criteria of "good" news article

From @pauloricardomg on April 1, 2018 22:45

Fake news sources could claim that the fake-news-detector is arbitrary if their site is flagged as fake or extremely biased.

In order to avoid this, the project should define objective and verifiable criteria for what defines a "good" news article, and classify each submitted article according to this criteria. Sources with a vast majority of their articles matching the "good article" criteria can be tagged as potentially trustworthy, and sources with the majority of their articles not matching the criteria can be tagged as potentially unstrustworthy.

This would help in cases like the one mentioned on https://github.com/fake-news-detector/extension/issues/21, where a particular news is not fake but is posted by a dubious source. In these cases, the tool could say the article itself is not fake, but the source is known to not follow the "good news" criteria - and objectively point in which articles from the same source this criteria was not followed.

Defining the "good news" criteria is not an easy task, but we could start by listing characteristics of articles published by legitimate sources. Another important aspect is that the criteria should be verifiable, so the verification could be performed programmatically and users can be pointed to exactly which criteria is not being followed on articles flagged as "potentially untrustworthy".

In order to get the conversation started, I propose the following criteria for defining a "good" news article:

Criterion	Why it is important	How to Verify
Does the article have a date or is it timeless?	Every news should have a publishing date, otherwise it cannot be considered a news.	Check if the article has a timestamp on it.
Is the author clearly stated in the news and does the author exists?	Writers should be accountable for their articles, so articles without authors are generally not trustworthy.	Check if the article has an author, and if the author is a real person.
Does the article clearly defines if it's an opinion article or news?	Fake news are normally opinion articles disguised as news. There is no problem in having biased opinion articles, as long as the source clearly states that it's an opinion article and not news.	Check for tags or headings stating if the article is a news or an opinion article.
Was the article updated without notice since it was published?	A source that updates it's content without notifying the users of the changes is not trustworthy.	Cache a fingerprint of the website in the first access, at each subsequent access check if the fingerprint changed without notice.

These are just initial suggestions to get the conversation started, feedback is welcome. Perhaps we should look in the literature for other criteria.

Once we define the criteria that defines a good article, we can create follow-up tasks to create API endpoints that will verify a URL against each (or all) criteria.

Copied from original issue: fake-news-detector/api#30

Okay, so, first, thank you for bringing this discussion, this is great :D

About what I commented on extension#21, my idea of adding domains to a hardcoded list is not to consider every news publication from that domain as fake or trustable, but more as an indication to the user that the domain in question is usually reliable or not, and we could show the info about the domain separately from evaluation of the content, like how Fakebox does:

This would make our intention clearer. For now, I'm thinking of changing the current ✅ legitimate (verified) to ✅ reliable domain, just a change of words that makes all the difference, what you think?

Fake news sources could claim that the fake-news-detector is arbitrary if their site is flagged as fake or extremely biased.

Yes, and I don't see that as a problem as long as we always make it clear we are making guesses, not pointing fingers, by using machine learning algorithms which can get it wrong (an will in a lot of cases). If this is not clear to people, it is probably a design or communication issue on the website or extension that we have to fix.

It is true though that we should try to explain how the algorithms came to those guesses. For example, right now we are using a bag of words approach, which means that something is classified as fake news if the relevant words used in it are similar to relevant words already used in other news that was previously classified. So one thing I want to use to clarify the guess is Lime which was designed exactly for that!

But maybe, on the future, if we use something fancier like some deep neural net (which I'll definitely try soon) it might be impossible to explain how the algorithm came to that conclusion in a pretty way. So, I believe it is ok to not be always clear how the guess is made, as long as we make it clear it is just that, just a guess, an educated guess, learned on the opinions of the crowd using the extension.

On the other hand, you are right that the hardcoded list, and the verified content on the admin area, and everything else approved by the maintainers of the project should follow a guideline, so we should try to avoid expressing our bias there, and try to be more impartial.

I never wanted the hardcoded list to grow very much, also to avoid polemic, I wanted to put there only things there are very very obvious (ex: theonion is satire, infowars is fake news, those are not opinions, those are facts, its on wikipedia) mainly due to popularity. So popular websites being there may save the robot and the humans some work, because they are popular so their articles will be the ones shared the most.

The other reason to not focus on the hardcoded list is that there are other extensions that already take this approach (B.S. Detector and Le Monde's Decodex), and we could also just trust their work and incorporate their list into ours (with credits of course), I'd like that.

Finally, a hardcoded list doesn't scale, the power I want Fake News Detector to have is precisely being able to generalize well, and quickly classify breaking news from never-heard-of sources.

About the criteria you exemplified: "does it have a date?", "who is the author?", "is it explicit that it's an opinion?", those are great criteria! But not to enter the hardcoded list I'd say, those are actually great to be on the regular extension flow, where users say if the content has those characteristics, and then with machine learning we create a model that try to check those things automatically.

I was already thinking about that, the same way I added a separate question for users to answer if something is clickbait or not, my idea is to add those other questions there: "Does it have a date?", etc

Because although some of those checks are easy to automate with plain code (like the date check), some others are very hard (like weather it states clearly that it is an opinion or not), so we can build a ML model that learns how to detect those things, and then use it to improve the fake news or not detection.

Now back to the hardcoded list, in my mind the criteria would be more like:

Is this a popular domain? Does it have a very long history of reliable, fact-checked news? It may go to the Reliable List
Does it has a history of news that are fake news for sure? (maybe already debunked ones). Just a couple of fake news is reason enough to go to the Unreliable List

What I mean is, it should be very hard to be reliable, you have to prove yourself all the time, but if you tell lies on purpose once in a while, that's enough to be unreliable.

I'm not sure if I really captured your intention with this issue, I probably overexplained myself trying to synchronize our ideas, sorry hahahah, but I hope we are on the same page now

Thoughts?

fake-news-detector / fake-news-detector

Define verifiable criteria of "good" news article #11