Closed Humorloos closed 2 years ago
Common Simple Matchers Implementation built with Wint.r framework:
LevenshteinEditDistance
, TFIDFCosineSimilarity
, VectorSpaceCosineSimilarity
VectorSpaceCosineSimilarity
, TFIDFCosineSimilarity
& TokenizingJaccardSimilarity
Iteratively, else: GeneralisedMaximumOfContainment
, ComplexSetSimilarity
& GeneralisedJaccard
.Netflix-IMDb based
matchers:
VectorSpaceCosineSimilarity
, TFIDFCosineSimilarity
& TokenizingJaccardSimilarity
Iteratively, else: GeneralisedMaximumOfContainment
, ComplexSetSimilarity
& GeneralisedJaccard
.AbsoluteDifferenceSimilarity
, [Unadjusted/]DeviationSimilarity
and PercentageSimilarity
.Netflix-Streaming
based matchers:
Vector/Cosine/TF-IDF similarity metric testing
b. PercentageSimilarity
or DeviationSimilarity
metric inclusion Iteratively, else: GeneralisedMaximumOfContainment
, ComplexSetSimilarity
& GeneralisedJaccard
.Note: Development work is in progress for these matchers. Any new ideas for creating simple matchers for the datasets will be really appreciated. Thanks
Update: An approach update, used TitleMatcher
with Levenstein Distance
by using TitleBlocker
for creating the gold standard. Currently, manual preparation from this Matcher
is in progress.
Note: Linear combination based matchers with year blockers turned out too good, making the FP & FN pairs are quite harder for them.
Hi @subashp93, need your help in constructing the gold standard
for streaming-netflix
dataset pair. Use the file gold_standard_base.csv
below for constructing the other two gold standard files for reference. Please, pick you entries post the 200k
mark in the base.csv
file.
gold_standard_netflix_streaming_input.csv gold_standard_netflix_streaming_reference.csv gold_standard_base.csv
Handy commands to refer to peek into the xml
file data from the terminal to prepare the gold standard:
grep -rnw 'streaming.xml' -e 'streaming_10499'
sed -n '472145,+10p' streaming.xml
grep -rnw 'netflix.xml' -e 'netflix_7141'
sed -n '490720,+10p' netflix.xml
Another base reference file for gold standard creation for netflix-streaming
datasets with driving metric as Levenstein Distance
supplemented with Year
Blocker.
has dependencies #30 #31
Implement very simple matcher based on
apply matchers to netflix - streaming and netflix - imdb pairs to retrieve similarities for all pairs of movies
provide lists of pairs sorted by similarity for preparation of gold standards
Deadline: 11-14