daqcri / DAFNA-EA

Truth Discovery Models
10 stars 3 forks source link

DAFNA-EA

DAFNA-EA is a java library of truth discovery methods from the literature to evaluate the veracity of data claimed by multiple online sources.

The methods that have been implemented for the comparative study are the following: More detail can be found here.

Datasets

Real-world data sets are available here.

A dataset generator for truth discovery scenario can be donwladed here and description of the parameters are given here with a full documentation.

Citations

To cite DAFNA-EA in publications use:

For LaTeX users:

author    = {Dalia Attia Waguih and Laure Berti{-}Equille},
title     = {Truth Discovery Algorithms: An Experimental Evaluation},
journal   = {CoRR},
volume    = {abs/1409.6428},
year      = {2014},
url       = {http://arxiv.org/abs/1409.6428}}

To cite ensembling of truth discovery methods

For LaTeX users:

author    = {Laure Berti{-}Equille},
title     = {Data veracity estimation with ensembling truth discovery methods},
booktitle = {2015 {IEEE} International Conference on Big Data, Big Data 2015, Santa Clara, CA, USA, October 29 - November 1, 2015},
pages     = {2628--2636},
year      = {2015}}

For a survey:

For LaTeX users:

author    = {Laure Berti{-}Equille and Javier Borge{-}Holthoefer},
title     = {Veracity of Data: From Truth Discovery Computation Algorithms to Models of Misinformation Dynamics},
series    = {Synthesis Lectures on Data Management},
publisher = {Morgan {\&} Claypool Publishers},
year      = {2015}}

Tutorials

Two tutorials surveying truth discovery methods and the topic of data veracity are available here.

API

We have releasee an API so that users can test the truth discovery methods on their own. Documentation of the API is here

Demos

You can try the demos:

Build Instructions fro DAFNA-EA

Make sure you have installed Java 7 and Maven on your computer first. Before the first build you need to prepare some libraries in your local repository:

mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file -Dfile=lib/independance-vldb2010-solomon-cleaner.jar \
-DgroupId=com.att.research -DartifactId=solomon.cleaner -Dversion=0.0.1 -Dpackaging=jar -DlocalRepositoryPath=my-repo

mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file -Dfile=lib/simmetrics_jar_v1_6_2_d07_02_07.jar \
-DgroupId=uk.ac.shef.wit -DartifactId=simmetrics -Dversion=1.6.2 -Dpackaging=jar -DlocalRepositoryPath=my-repo

Building for AllegatorTrack

To build a jar containing all the algorithms ready to consume for AllegatorTrack UI and API:

mvn clean # cleans previously created jar
mvn package # builds everything

or simply

mvn clean package

This will build a jar located in target folder. Just copy it to AllegatorTrack:

cp target/DAFNA-EA-1.0-jar-with-dependencies.jar <AllegatorTrack-root>/vendor

Building from eclipse

Import the project as a Maven project then build it normally (recommended). Alternatively, import as Java project and set the classpath manually to include all dependancies listed in pom.xml.

Command line invocation

java -jar <JAR_PATH> <ALGORITHM_NAME> <DATASETS_CLAIMS_DIR> <DATASETS_GROUND_DIR> <OUTPUT_DIR> <ALGORITHM_PARAMS>

Where <JAR_PATH> points to the generated jar file in the build section. <ALGORITHM_NAME> stands for algorithm name, which can be one of the following: Cosine, 2-Estimates, 3-Estimates, Depen, Accu, AccuSim, AccuNoDep, TruthFinder, SimpleLCA, GuessLCA, MLE or LTM. <DATASETS_CLAIMS_DIR>, <DATASETS_GROUND_DIR> and <OUTPUT_DIR> point to directories where CSV claim, ground files and the directory where all output files should be generated, respectively.

<ALGORITHM_PARAMS> is a white-space separated values and are dependant on the algorithm selected. In all cases, general parameters come first followed by specific parameters. Details of parameters for each algorithms can be found here.

There are 3 possible patterns for the <ALGORITHM_PARAMS>:

Normal algorithm

4 General then specific parameters.

Combiner invocation

4 General parameters are all set to 0, followed by the number of algorithms to be combined. Next comes a number of file paths pointing to claim results generated from the corresponding algorithm before calling the combiner. Example:

java -jar <JAR_PATH> <ALGORITHM_NAME> <DATASETS_CLAIMS_DIR> <DATASETS_GROUND_DIR> <OUTPUT_DIR> 0 0 0 0 3 results1.csv results2.csv results3.csv

Allegator invocation

5 extra parameters are added at the end:

java -jar <JAR_PATH> <ALGORITHM_NAME> <DATASETS_CLAIMS_DIR> <DATASETS_GROUND_DIR> <OUTPUT_DIR> <ALGORITHM_PARAMS> <RUN_ID> <CLAIM_ID> <CLAIM_RESULTS_FILE> <SOURCE_TRUSTWORTHINESS_FILE> Allegate

Where <ALGORITHM_PARAMS> is the same as in normal invocation, <RUN_ID> and <CLAIM_ID> denote the run id and the claim id being allegated, respectively. These can be anything and are only used for convenience to generate meaningful file names in the output. <CLAIM_RESULTS_FILE> and <SOURCE_TRUSTWORTHINESS_FILE> point to results generatd by the run being allegated. Allegate should be put as is.