McMasterAI / CoviDash

1 stars 2 forks source link

COVID-19 Statement/Fact Checker #2

Open cczarnuch opened 4 years ago

jaymody commented 4 years ago

Goal

From preliminary discussions, it seems the general idea is we want to fact check COVID-19 related statements and claims, for example:

Results From Breaking Chloroquine Study Show 100% Cure Rate For Patients Infected With The Coronavirus.

COVID-19 related fake news is very diverse. The focus for this application maybe should ONLY focus on fact checking the academic and scientific aspects. For example, the following claim is about coronavirus, but doesn't speak on relevant health related facts being studied in literature:

U.S. President Donald Trump will benefit financially if hydroxychloroquine becomes an established treatment for COVID-19.

Maybe we should discuss this more, as this might be too niche of an application. Especially given that covid literature search engines already exist, the unique aspect of this project would be that it excepts statements instead of questions, and verifies those statements rather than informing about a question. It might be argued that this is worse, since there is the potential to misverify a claim, versus if it is searched as a question, there is that extra interpretation step from the user.

For now, we'll define our goal as the following: Input: A text based claim about COVID-19 that pertains to it's effects/cures/behaviour etc ... In general, the claim should be something that can be verified or referenced by literature or studies.

Output: Either:

Data

This is probably the biggest gap in this process.

Where do we get COVID-19 related claims data? How do we ensure it fits the criteria of our goal (ie how do we make sure it's not the second example)? How do we evaluate our solution afterwards? Some of these answers will probably become more clear as we refine our goal definition.

Initial Implementation Ideation

Naive Sequence Classification Approach

A naive approach would be to simply make a classifier that runs on claims. This has the advantage of our solution having access to way more data, since this opens up usage of all fake news data. However, given the limited amount of annotated COVID news data, this might scale poorly for this application. In addition, we would be unable to provide an explanation for the outputs, and we would fail to point to the relevant literature that proves, disproves, or speaks on the claim.

Information Retrieval Approach (Sequence Pair Classification/Stance Detection)

The other approach would be to create an IR (information retrieval) system for COVID-19 literature and studies. We can use the searched documents as inputs to our models decisions. This also allows us to phrase the problem as a stance detection problem instead of a simple sequence classification problem. For example, if we assume the documents in the IR corpus are all reliable, we can simply check if relevant documents agrees, disagree, or are neutral about a given claim (agree means the claim is true, disagree means it's false, neutral means it's undetermined). In this setup, we can also point to relevant literature and studies as an explanation for the models decision.

The IR Engine

Luckily for us, an open source COVID-19 literature (CORD-19 dataset) search engine already exists and is open source, the covidex search engine. It already utilizes many SOTA techniques for IR, so I doubt we can improve upon this engine (or even get a search engine working in a month).

I believe covidex also provides the most relevant information from the relevant documents, so we shouldn't need to do that as well.

We'll need to learn what kind of queries work best on covidex. For example, if rephrasing the claims as question is beneficial for the search engine (since most of the example queries on the websites are questions, not statements), we might be able to use of question generation to get better queries.

Datasets for Stance Detection

For stance detection, here are some relevant datasets:

Models

SOTA for sequence pair classification are Transformers based models, so we can make use of pytorch here via the huggingface transformers library.

Considerations

What we know about COVID-19 is constantly changing. We need to take into consideration the fact that many studies don't make definitive claims, and that what we know is always subject to change. This also means when we fact check a claim, we need to take into consideration the published date of any relevant information when our model makes a decision.

jaymody commented 4 years ago

Tasks

Assuming we take the IR + stance detection approach, here would be a possible list of stories (leaving this here instead of making them issues since we should first clear up the actual goals/feasibility of the project before moving forward. Plus, we may want to put this in a separate repo):

For first milestone

Low Priority

Misc

jaymody commented 4 years ago

Position product as an automated tool for fact checking rather than a search engine for statements (ie a social media corona fact checking dashboard).

possible features:

Milestones

  1. input text --> output json
  2. website search box
  3. static, predone
  4. dynamically pull tweets/social media data for the dashboard
  5. turn into a chrome extension