This project constitutes the web scraping component of ClaimsKG that crawls fact checking sites (mostly taken from https://www.poynter.org/international-fact-checking-network-fact-checkers-code-principles, which holds a list of reliable fact-checking sites) and generates a CSV file with a dump of the extracted information.
This project is a fork of https://github.com/vwoloszyn/fake_news_extractor that has been refactored and repurposed for the specific needs of ClaimsKG. Most of the original extractors for English-language fact-checking sites have been reimplemented under a new architecture and are the only ones that are functional (see list below). Althrough the original extractors for Portuguese language and German language sites are still present, they haven't yet been integrated, please refer to the orignal implementation if you need to use those.
See the ClaimsKG dataset website for statistics (https://data.gesis.org/claimskg/site)
See overview with example files at WIKI: https://github.com/claimskg/claimskg-extractor/wiki
This version of the extractor doesn't annotate the description and claim with entities on its own, there is a consecutive step to add annotations to the CSV files with TagMe (see tagme fork in the claimskg project group).
Given the varied rating schemes used by the fact-checking websites, where individual labels often are hard to objectively apply or interpret, we apply a simple normalized rating scheme consisting of four basic categories that can be mapped in a consensual way to all existing rating schemes: TRUE, FALSE, MIXTURE, OTHER. We provide full correspondence tables here: https://goo.gl/Ykus98
This reimplementation runs on Python3.5+. Redis is used for caching HTTP querries in order to allow faster resuming of extractions in case of failure and for a faster iterative development of new extractors. Please make sure to have a Redis instance (default parameters) running on the machine that runs the extractor. Expected package dependencies are listed in the "requirements.txt" file for PIP, you need to run the following command to get dependencies:
pip install -r requirements.txt
Export claims to a csv file named "output_got.csv".
python Exporter.py -h
python Exporter.py --website fullfact,snopes
python Exporter.py --maxclaims 30
If you wish to remove the cache entries relative to a particular site, you can use the following command, where SITENAME should be replaced with the site's name as listed above.
redis-cli --raw keys "http://*SITENAME*" | xargs redis-cli --raw del -
https://github.com/eleutheromastrophimatique/ClaimsExtractor-/tree/master/claim_extractor/extractors
https://gitlab.info-ufr.univ-montp2.fr/e20160008449/TER-M1_FullFact/tree/master (fullfact)
https://github.com/ImaneLamriou/TER/tree/master/extractors
https://github.com/massykezzoul/claims-checking/tree/master/src/websites