DebanjanaKar / Covid19_FakeNews_Detection

IBM Hackathon
22 stars 9 forks source link

TathyaCov : Detecting Fake Tweets in the times of COVID 19

title

DEMO VIDEO: https://youtu.be/pdWoBxBu9-k

This repository contains the implementation of the paper : "No Rumours Please! A Multi-Indic-Lingual Approach for Covid Fake-Tweet Detection" which has been accepted at GHCI 2020 in the original research track. The system aims to classify whether a tweet contains a verifiable claim or not in real-time and has been specifically trained to detect COVID19 related fake news. We use AI based techniques to process the tweet text and use it, along with user features, to classify the tweets as either REAL or FAKE. We are handling tweets in three different languages: English, Hindi and Bengali.

flowchart

Structure :

Each of the folders are equipped with detailed READMEs on how to run the scripts.

We next provide a very brief overview of the dataset and the methods used in our work in the following sections.

Dataset:

We create the Indic-covidemic tweet dataset and use it for training and testing purpose. We consider the English tweets from the Infodemic dataset and scrape Bengali and Hindi tweets from Twitter which are related to COVID-19. Fresh annotations were done and incorporated to create the larger Indic dataset for this task. For this purpose, scraping and parsing tools were created which might be helpful to further mine Indic data. We have published our annotated dataset for research purposes which can be found here.

Method:

We experimented with two different models to handle the tweet classification. In one setting, we consider a mono-lingual model, for handling English tweets. We extend the concept, by replacing the classifier with the multi-lingual one, where we consider tweets from English, Hindi and Bengali languages, as of now. The main essence of our proposed approach lies in the features we have used for the classification task, the different classifiers and their corresponding adaptation done for identifying the fake tweets.

The architecture of the classifier is as shown below.

mono_ar

We have used various textual and user related features for the classification task as follows:

    <p align="center">
      <img width="450" alt="mono_features" src="https://github.com/DebanjanaKar/Covid19_FakeNews_Detection/blob/master/images/correlation.png">
    </p>
It is evident from the correlation plot that a subset of user features and tweet features can be helpful. We have experimented with different classifiers, the results of which are as given below.
<p align="center">
   <img width="350" alt="mono_result" src="https://github.com/DebanjanaKar/Covid19_FakeNews_Detection/blob/master/images/mono_results.png">
    <img width="350" alt="multi_result" src="https://github.com/DebanjanaKar/Covid19_FakeNews_Detection/blob/master/images/multi_results.png">
</p>

Graphical User Interface (GUI):

We design a simple static HTML page to obtain the tweet id/URL, as user input, and detect if the tweet is real or fake. Though our monolingual English classifier gave the best performance, even by beating the SOTA, we choose the multi-lingual classifier for its wider application. Some of the snapshots of our demo is shown below:

gui_hindi
gui_bengali
gui_english

FLASK API:

The GUI has been hosted in a IBM server (http://pca.sl.cloud9.ibm.com:1999/) which is accessible within IBM domain.
process.py is a working code to host the GUI in the localhost. It can be easily modified to host the demo in any other server as well.

Citation :

If you find our work useful, please cite our work as:

@misc{kar2020rumours,
      title={No Rumours Please! A Multi-Indic-Lingual Approach for COVID Fake-Tweet Detection}, 
      author={Debanjana Kar and Mohit Bhardwaj and Suranjana Samanta and Amar Prakash Azad},
      year={2020},
      eprint={2010.06906},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}