KaiDMML / FakeNewsNet

This is a dataset for fake news detection research
1.07k stars 424 forks source link

FakeNewsNet

We will never ask for money to share the datasets. If someone claims that s/he has the all the raw data and wants a payment, please be careful.

We released a tool FakeNewsTracker, for collecting, analyzing, and visualizing of fake news and the related dissemination on social media. Check it out!

The latest dataset paper with detailed analysis on the dataset can be found at FakeNewsNet

Please use the current up-to-date version of dataset

Previous version of the dataset is available in branch named old-version of this repository.

Overview

Complete dataset cannot be distributed because of Twitter privacy policies and news publisher copy rights. Social engagements and user information are not disclosed because of Twitter Policy. This code repository can be used to download news articles from published websites and relevant social media data from Twitter.

The minimalistic version of latest dataset provided in this repo (located in dataset folder) include following files:

Each of the above CSV files is comma separated file and have the following columns

Installation

Requirements:

Data download scripts are writtern in python and requires python 3.6 + to run.

Twitter API keys are used for collecting data from Twitter. Make use of the following link to get Twitter API keys
https://developer.twitter.com/en/docs/basics/authentication/guides/access-tokens.html

Script make use of keys from _tweet_keysfile.json file located in code/resources folder. So the API keys needs to be updated in tweet_keys_file.json file. Provide the keys as array of JSON object with attributes app_key,app_secret,oauth_token,oauth_token_secret as mentioned in sample file.

Install all the libraries in requirements.txt using the following command

pip install -r requirements.txt

Configuration:

FakeNewsNet contains 2 datasets collected using ground truths from Politifact and Gossipcop.

The config.json can be used to configure and collect only certain parts of the dataset. Following attributes can be configured

Running Code

Inorder to collect data set fast, code makes user of process parallelism and to synchronize twitter key limitations across mutiple python processes, a lightweight flask application is used as keys management server. Execute the following commands inside code folder,

nohup python -m resource_server.app &> keys_server.out&

The above command will start the flask server in port 5000 by default.

Configurations should be done before proceeding to the next step !!

Execute the following command to start data collection,

nohup python main.py &> data_collection.out&

Logs are wittern in the same folder in a file named as data_collection_<timestamp>.log and can be used for debugging purposes.

The dataset will be downloaded in the directory provided in the config.json and progress can be monitored in data_collection.out file.

Dataset Structure

The downloaded dataset will have the following folder structure,

├── gossipcop
│   ├── fake
│   │   ├── gossipcop-1
│   │   │   ├── news content.json
│   │   │   ├── tweets
│   │   │   │   ├── 886941526458347521.json
│   │   │   │   ├── 887096424105627648.json
│   │   │   │   └── ....        
│   │   │   └── retweets
│   │   │       ├── 887096424105627648.json
│   │   │       ├── 887096424105627648.json
│   │   │       └── ....
│   │   └── ....            
│   └── real
│      ├── gossipcop-1
│      │    ├── news content.json
│      │    ├── tweets
│      │    └── retweets
│       └── ....        
├── politifact
│   ├── fake
│   │   ├── politifact-1
│   │   │   ├── news content.json
│   │   │   ├── tweets
│   │   │   └── retweets
│   │   └── ....        
│   │
│   └── real
│      ├── poliifact-2
│      │    ├── news content.json
│      │    ├── tweets
│      │    └── retweets
│      └── ....                 
├── user_profiles
│       ├── 374136824.json
│       ├── 937649414600101889.json
│           └── ....
├── user_timeline_tweets
│       ├── 374136824.json
│       ├── 937649414600101889.json
│       └── ....
└── user_followers
│       ├── 374136824.json
│       ├── 937649414600101889.json
│       └── ....
└──user_following
            ├── 374136824.json
        ├── 937649414600101889.json
        └── ....

News Content

news content.json: This json includes all the meta information of the news articles collected using the provided news source URLs. This is a JSON object with attributes including:

Social Context

tweets folder: This folder contains all tweets related to the news sample. This contains the tweet objects of the all the tweet ids provided in the tweet_ids attribute of the dataset csv. All the files in this folder are named as <tweet_id>.json . Each <tweet_id>.json file is a JSON file with format mentioned in https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object.html.

retweets folder: This folder contains the retweets of the all tweets posted sharing a particular news article. This folder contains files named as <tweet_id>.json and it contains a array of the retweets for a particular tweets. Each object int the retweet array have format mentioned in https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-retweets-id.

user_profiles folder: This folder contains all the user profiles of the users posting tweets related to all news articles. This same folder is used for both datasources ( Politifact and GossipCop). It contains files named as <user_id>.json and have JSON formated mentioned in https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object.html

user_timeline_tweets folder: This folder contains files representing the time line of tweets of users posting tweets related to fake and real news. All files in the folder are named as <user_id>.json and have JSON array of upto 200 recent tweets of the users. The files have format mentioned same as https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline.html.

user_followers folder: This folder contains all the user followers ids of the users posting tweets related to all news articles. This same folder is used for both datasources ( Politifact and GossipCop). It contains files named as <user_id>.json and have JSON data with user_id and followers attributes.

user_following folder: This folder contains all the user following ids of the users posting tweets related to all news articles. This same folder is used for both datasources ( Politifact and GossipCop). It contains files named as <user_id>.json and have JSON data with user_id and following attributes.

References

If you use this dataset, please cite the following papers:

@article{shu2018fakenewsnet,
  title={FakeNewsNet: A Data Repository with News Content, Social Context and Dynamic Information for Studying Fake News on Social Media},
  author={Shu, Kai and  Mahudeswaran, Deepak and Wang, Suhang and Lee, Dongwon and Liu, Huan},
  journal={arXiv preprint arXiv:1809.01286},
  year={2018}
}
@article{shu2017fake,
  title={Fake News Detection on Social Media: A Data Mining Perspective},
  author={Shu, Kai and Sliva, Amy and Wang, Suhang and Tang, Jiliang and Liu, Huan},
  journal={ACM SIGKDD Explorations Newsletter},
  volume={19},
  number={1},
  pages={22--36},
  year={2017},
  publisher={ACM}
}
@article{shu2017exploiting,
  title={Exploiting Tri-Relationship for Fake News Detection},
  author={Shu, Kai and Wang, Suhang and Liu, Huan},
  journal={arXiv preprint arXiv:1712.07709},
  year={2017}
}

(C) 2019 Arizona Board of Regents on Behalf of ASU