Implement very simple matchers for gold standard generation

Humorloos commented 3 years ago

has dependencies #30 #31

Implement very simple matcher based on
- title e.g. based on edit distance
- some other attribute that is also present in all 3 datasets
apply matchers to netflix - streaming and netflix - imdb pairs to retrieve similarities for all pairs of movies
provide lists of pairs sorted by similarity for preparation of gold standards

Deadline: 11-14

ashishrana160796 commented 3 years ago

Common Simple Matchers Implementation built with Wint.r framework:

Matcher for Title: LevenshteinEditDistance, TFIDFCosineSimilarity, VectorSpaceCosineSimilarity
Matchers for genre, director list attributes: VectorSpaceCosineSimilarity, TFIDFCosineSimilarity & TokenizingJaccardSimilarity Iteratively, else: GeneralisedMaximumOfContainment, ComplexSetSimilarity & GeneralisedJaccard.

ashishrana160796 commented 3 years ago

Netflix-IMDb based matchers:

Matchers for actor names, production company and writer list attribute: VectorSpaceCosineSimilarity, TFIDFCosineSimilarity & TokenizingJaccardSimilarity Iteratively, else: GeneralisedMaximumOfContainment, ComplexSetSimilarity & GeneralisedJaccard.
Numeric matcher for budget attribute: AbsoluteDifferenceSimilarity, [Unadjusted/]DeviationSimilarity and PercentageSimilarity.

Netflix-Streaming based matchers:

Matchers based on linear combination of (director, language), (director, language, genre) and (director, language & duration): a. Vector/Cosine/TF-IDF similarity metric testing b. PercentageSimilarity or DeviationSimilarity metric inclusion Iteratively, else: GeneralisedMaximumOfContainment, ComplexSetSimilarity & GeneralisedJaccard.

Note: Development work is in progress for these matchers. Any new ideas for creating simple matchers for the datasets will be really appreciated. Thanks

ashishrana160796 commented 2 years ago

Update: An approach update, used TitleMatcher with Levenstein Distance by using TitleBlocker for creating the gold standard. Currently, manual preparation from this Matcher is in progress.

Note: Linear combination based matchers with year blockers turned out too good, making the FP & FN pairs are quite harder for them.

ashishrana160796 commented 2 years ago

Hi @subashp93, need your help in constructing the gold standard for streaming-netflix dataset pair. Use the file gold_standard_base.csv below for constructing the other two gold standard files for reference. Please, pick you entries post the 200k mark in the base.csv file.

gold_standard_netflix_streaming_input.csv gold_standard_netflix_streaming_reference.csv gold_standard_base.csv

ashishrana160796 commented 2 years ago

Handy commands to refer to peek into the xml file data from the terminal to prepare the gold standard:

grep -rnw 'streaming.xml' -e 'streaming_10499'
sed -n '472145,+10p' streaming.xml

grep -rnw 'netflix.xml' -e 'netflix_7141'
sed -n '490720,+10p' netflix.xml

ashishrana160796 commented 2 years ago

Another base reference file for gold standard creation for netflix-streaming datasets with driving metric as Levenstein Distance supplemented with Year Blocker.

gold_standard_year_blocker_base.csv

Humorloos / IE683

Implement very simple matchers for gold standard generation #33