Humorloos / IE683

0 stars 0 forks source link

Implement very simple matchers for gold standard generation #33

Closed Humorloos closed 2 years ago

Humorloos commented 3 years ago

has dependencies #30 #31

Deadline: 11-14

ashishrana160796 commented 3 years ago

Common Simple Matchers Implementation built with Wint.r framework:

ashishrana160796 commented 3 years ago

Netflix-IMDb based matchers:

Netflix-Streaming based matchers:

Note: Development work is in progress for these matchers. Any new ideas for creating simple matchers for the datasets will be really appreciated. Thanks

ashishrana160796 commented 2 years ago

Update: An approach update, used TitleMatcher with Levenstein Distance by using TitleBlocker for creating the gold standard. Currently, manual preparation from this Matcher is in progress.

Note: Linear combination based matchers with year blockers turned out too good, making the FP & FN pairs are quite harder for them.

ashishrana160796 commented 2 years ago

Hi @subashp93, need your help in constructing the gold standard for streaming-netflix dataset pair. Use the file gold_standard_base.csv below for constructing the other two gold standard files for reference. Please, pick you entries post the 200k mark in the base.csv file.

gold_standard_netflix_streaming_input.csv gold_standard_netflix_streaming_reference.csv gold_standard_base.csv

ashishrana160796 commented 2 years ago

Handy commands to refer to peek into the xml file data from the terminal to prepare the gold standard:

grep -rnw 'streaming.xml' -e 'streaming_10499'
sed -n '472145,+10p' streaming.xml

grep -rnw 'netflix.xml' -e 'netflix_7141'
sed -n '490720,+10p' netflix.xml
ashishrana160796 commented 2 years ago

Another base reference file for gold standard creation for netflix-streaming datasets with driving metric as Levenstein Distance supplemented with Year Blocker.

gold_standard_year_blocker_base.csv