Open rogeriochaves opened 6 years ago
Okay, so, first, thank you for bringing this discussion, this is great :D
About what I commented on extension#21, my idea of adding domains to a hardcoded list is not to consider every news publication from that domain as fake or trustable, but more as an indication to the user that the domain in question is usually reliable or not, and we could show the info about the domain separately from evaluation of the content, like how Fakebox does:
This would make our intention clearer. For now, I'm thinking of changing the current ā
legitimate (verified)
to ā
reliable domain
, just a change of words that makes all the difference, what you think?
Fake news sources could claim that the fake-news-detector is arbitrary if their site is flagged as fake or extremely biased.
Yes, and I don't see that as a problem as long as we always make it clear we are making guesses, not pointing fingers, by using machine learning algorithms which can get it wrong (an will in a lot of cases). If this is not clear to people, it is probably a design or communication issue on the website or extension that we have to fix.
It is true though that we should try to explain how the algorithms came to those guesses. For example, right now we are using a bag of words approach, which means that something is classified as fake news if the relevant words used in it are similar to relevant words already used in other news that was previously classified. So one thing I want to use to clarify the guess is Lime which was designed exactly for that!
But maybe, on the future, if we use something fancier like some deep neural net (which I'll definitely try soon) it might be impossible to explain how the algorithm came to that conclusion in a pretty way. So, I believe it is ok to not be always clear how the guess is made, as long as we make it clear it is just that, just a guess, an educated guess, learned on the opinions of the crowd using the extension.
On the other hand, you are right that the hardcoded list, and the verified content on the admin area, and everything else approved by the maintainers of the project should follow a guideline, so we should try to avoid expressing our bias there, and try to be more impartial.
I never wanted the hardcoded list to grow very much, also to avoid polemic, I wanted to put there only things there are very very obvious (ex: theonion is satire, infowars is fake news, those are not opinions, those are facts, its on wikipedia) mainly due to popularity. So popular websites being there may save the robot and the humans some work, because they are popular so their articles will be the ones shared the most.
The other reason to not focus on the hardcoded list is that there are other extensions that already take this approach (B.S. Detector and Le Monde's Decodex), and we could also just trust their work and incorporate their list into ours (with credits of course), I'd like that.
Finally, a hardcoded list doesn't scale, the power I want Fake News Detector to have is precisely being able to generalize well, and quickly classify breaking news from never-heard-of sources.
About the criteria you exemplified: "does it have a date?", "who is the author?", "is it explicit that it's an opinion?", those are great criteria! But not to enter the hardcoded list I'd say, those are actually great to be on the regular extension flow, where users say if the content has those characteristics, and then with machine learning we create a model that try to check those things automatically.
I was already thinking about that, the same way I added a separate question for users to answer if something is clickbait or not, my idea is to add those other questions there: "Does it have a date?", etc
Because although some of those checks are easy to automate with plain code (like the date check), some others are very hard (like weather it states clearly that it is an opinion or not), so we can build a ML model that learns how to detect those things, and then use it to improve the fake news or not detection.
Now back to the hardcoded list, in my mind the criteria would be more like:
What I mean is, it should be very hard to be reliable, you have to prove yourself all the time, but if you tell lies on purpose once in a while, that's enough to be unreliable.
I'm not sure if I really captured your intention with this issue, I probably overexplained myself trying to synchronize our ideas, sorry hahahah, but I hope we are on the same page now
Thoughts?
From @pauloricardomg on April 1, 2018 22:45
Fake news sources could claim that the
fake-news-detector
is arbitrary if their site is flagged as fake or extremely biased.In order to avoid this, the project should define objective and verifiable criteria for what defines a "good" news article, and classify each submitted article according to this criteria. Sources with a vast majority of their articles matching the "good article" criteria can be tagged as
potentially trustworthy
, and sources with the majority of their articles not matching the criteria can be tagged aspotentially unstrustworthy
.This would help in cases like the one mentioned on https://github.com/fake-news-detector/extension/issues/21, where a particular news is not fake but is posted by a dubious source. In these cases, the tool could say the article itself is not fake, but the source is known to not follow the "good news" criteria - and objectively point in which articles from the same source this criteria was not followed.
Defining the "good news" criteria is not an easy task, but we could start by listing characteristics of articles published by legitimate sources. Another important aspect is that the criteria should be verifiable, so the verification could be performed programmatically and users can be pointed to exactly which criteria is not being followed on articles flagged as "potentially untrustworthy".
In order to get the conversation started, I propose the following criteria for defining a "good" news article:
These are just initial suggestions to get the conversation started, feedback is welcome. Perhaps we should look in the literature for other criteria.
Once we define the criteria that defines a good article, we can create follow-up tasks to create API endpoints that will verify a URL against each (or all) criteria.
Copied from original issue: fake-news-detector/api#30