jaymody commented 4 years ago

Goal

From preliminary discussions, it seems the general idea is we want to fact check COVID-19 related statements and claims, for example:

Results From Breaking Chloroquine Study Show 100% Cure Rate For Patients Infected With The Coronavirus.

COVID-19 related fake news is very diverse. The focus for this application maybe should ONLY focus on fact checking the academic and scientific aspects. For example, the following claim is about coronavirus, but doesn't speak on relevant health related facts being studied in literature:

U.S. President Donald Trump will benefit financially if hydroxychloroquine becomes an established treatment for COVID-19.

Maybe we should discuss this more, as this might be too niche of an application. Especially given that covid literature search engines already exist, the unique aspect of this project would be that it excepts statements instead of questions, and verifies those statements rather than informing about a question. It might be argued that this is worse, since there is the potential to misverify a claim, versus if it is searched as a question, there is that extra interpretation step from the user.

For now, we'll define our goal as the following: Input: A text based claim about COVID-19 that pertains to it's effects/cures/behaviour etc ... In general, the claim should be something that can be verified or referenced by literature or studies.

Output: Either:

a classification of whether the statement is [True, False, Somewhere in between, or We Don't Know]
a classification of the how the statement relates to literature [Agree, Disagree, Neutral, We Don't Know] In either case we probably want to also provide an explanation as to why that decision was made. Preferably, we wan't to provide the relevant information from literature that proves/disproves the statement.

Data

This is probably the biggest gap in this process.

Where do we get COVID-19 related claims data? How do we ensure it fits the criteria of our goal (ie how do we make sure it's not the second example)? How do we evaluate our solution afterwards? Some of these answers will probably become more clear as we refine our goal definition.

Initial Implementation Ideation

Naive Sequence Classification Approach

A naive approach would be to simply make a classifier that runs on claims. This has the advantage of our solution having access to way more data, since this opens up usage of all fake news data. However, given the limited amount of annotated COVID news data, this might scale poorly for this application. In addition, we would be unable to provide an explanation for the outputs, and we would fail to point to the relevant literature that proves, disproves, or speaks on the claim.

Information Retrieval Approach (Sequence Pair Classification/Stance Detection)

The other approach would be to create an IR (information retrieval) system for COVID-19 literature and studies. We can use the searched documents as inputs to our models decisions. This also allows us to phrase the problem as a stance detection problem instead of a simple sequence classification problem. For example, if we assume the documents in the IR corpus are all reliable, we can simply check if relevant documents agrees, disagree, or are neutral about a given claim (agree means the claim is true, disagree means it's false, neutral means it's undetermined). In this setup, we can also point to relevant literature and studies as an explanation for the models decision.

The IR Engine

Luckily for us, an open source COVID-19 literature (CORD-19 dataset) search engine already exists and is open source, the covidex search engine. It already utilizes many SOTA techniques for IR, so I doubt we can improve upon this engine (or even get a search engine working in a month).

I believe covidex also provides the most relevant information from the relevant documents, so we shouldn't need to do that as well.

We'll need to learn what kind of queries work best on covidex. For example, if rephrasing the claims as question is beneficial for the search engine (since most of the example queries on the websites are questions, not statements), we might be able to use of question generation to get better queries.

Datasets for Stance Detection

For stance detection, here are some relevant datasets:

Models

SOTA for sequence pair classification are Transformers based models, so we can make use of pytorch here via the huggingface transformers library.

Considerations

What we know about COVID-19 is constantly changing. We need to take into consideration the fact that many studies don't make definitive claims, and that what we know is always subject to change. This also means when we fact check a claim, we need to take into consideration the published date of any relevant information when our model makes a decision.

jaymody commented 4 years ago

Tasks

Assuming we take the IR + stance detection approach, here would be a possible list of stories (leaving this here instead of making them issues since we should first clear up the actual goals/feasibility of the project before moving forward. Plus, we may want to put this in a separate repo):

For first milestone

[ ] Clone covidex and follow the necessary steps to get our own local version of covidex working (high priority since we need this to start annotating eval dataset with labels)
[ ] Compile our own dataset based on the goal we have defined. This is going to be what we evaluate our final solution with.
- [ ] Possible sources include twitter, facebook, instagram, politifact.com, snopes.com, weeklystandard.com, africacheck.org, polygraph.info, factcheck.org
- [ ] We should gather any COVID related data for future use and annotate each example if it makes sense for this use case
- [ ] We should do a blind reannotation with 1 or 2 people afterwards to verify our labels
- [ ] Each example should include:
  - a claim
  - the source (website)
  - the claimant (who made the claim)
  - date (when the claim was made by the claimant) [possibly optional?]
  - our label of whether literature agrees or disagrees with this statement, we can use the covidex search engine for this rating (we should have a formal discussion about how this process should be done for clarity and ensuring that it is staying true to our goal)
[ ] Compile datasets for stance detection
- [ ] Possible sources
  - SemEval-2017
  - MRPC
  - fnc-1
- [ ] Make sure if we are using multiple datasets, that the data is consitent with the overall task
- [ ] We should also research specialized architectures/pretrained models that may exist for this task
[ ] Train and evaluate the stance detection models of the stance detection data
[ ] Evaluate the performance of our solution (end to end) on the covid claims we compiled and curated ourselves

Low Priority

[ ] Test if question generation queries are more effective for extracting relevant documents
[ ] If we're making this a standalone application, we'll need to modify or redo the entire frontend of covidex to fit our needs. Otherwise, we'll need to figure out how to integrate with the proposed dashboard.
[ ] Deploy our solution online (as a website ideally) for clout

Misc

[ ] For training and reproducing covidex (indexing etc ...), we'll need some resources (cloud/GPU resources)
[ ] If we're deploying, what cloud service are we using?

jaymody commented 4 years ago

Position product as an automated tool for fact checking rather than a search engine for statements (ie a social media corona fact checking dashboard).

possible features:

chrome extension
bunch up tweets by possible topics (for example group by things like hydroxychloroquine into topics as well, to show trends in news by topic, especially topics)
sentiment
pie chart to show % true/false
live feed of tweets, and show analysis tweet by tweet

Milestones

input text --> output json
website search box
static, predone
dynamically pull tweets/social media data for the dashboard
turn into a chrome extension

McMasterAI / CoviDash

COVID-19 Statement/Fact Checker #2

Goal

Data

Initial Implementation Ideation

Naive Sequence Classification Approach

Information Retrieval Approach (Sequence Pair Classification/Stance Detection)

The IR Engine

Datasets for Stance Detection

Models

Considerations

Tasks

For first milestone

Low Priority

Misc

Milestones