The String Similarity on Flink Project from the Big Data Praktikum @ UNI Leipzig, SS2016
Required parameters:
--process default
--inputCsv path/to/concept_attribute.csv
Description: imports the concept_attribute.csv, filters only label attributes, maps id and value and prints it out
Required parameters:
--process createCompareCsv
--inputCsv path/to/concept_attribute.csv
--outputCsv path/to/output.csv
--removeBrackets [true|false]
Description:
Required parameters:
--process calculateSimilarity
--inputCsv path/to/crossMerged.csv
--outputDir path/to/output/directory
--algorithms stringCompare,stringCompareNgram,flinkSortMerge,sortMerge,simmetrics
To calculate the string similarity there are 4 different algorithms/techniques. This parameter controls which algorithm(s) will be used. By default, all will be executed.Optional parameters:
--threshold X.XX
Only tuples with a dice similarity >= X.XX will be collected in the result dataset
--tokenizeDigits Y
Size of an n-gram. Y = 3 by default.
Description:
concept.csv columns: entity id, uri, source
concept_attributes.csv columns: entity id, property name, property value, property type
linksWithIDs.csv columns: source entity id, target entity id