Open taohu88 opened 5 years ago
This would offer us the ability to calculate the similarity between strings/phrases. String similarity has a been a requested features for a while.
This also helps light up ranking without requiring the user having to pre-calculate the (query vs. document) features. Normally in learning-to-rank, you have three styles of features, (1) Query level features (eg. frequency of query, num words in query), (2) Document level features (page length, domain popularity), and (3) Query-dependent features (eg. num of common words in doc & query, word embedding distance)
This transform helps calculate #3
.
A VectorDistanceTransform
can be used in a few ways:
In a ranking pipeline, this would look like: (note command line syntax for semi-terseness)
CV data=MyRankingDataset.tsv loader=TextLoader{sep=, col=Label:R4:0 col=GroupId:TX:1
col=Query:TX:2 col=Doc:TX:3 col=QueryLevelFeatures:R4:4-53 col=DocLevelFeatures:R4:54-103 header=+} prexf=HashTransform{col=GroupId}
# Calculate Unigrams+Bigrams
xf=TextTransform{col=QueryUnigramBigram:Query wordExtractor=NGramHashExtractorTransform{ngram=2 bits=18 all=+} charExtractor={}} xf=TextTransform{col=DocUnigramBigram:Doc wordExtractor=NGramHashExtractorTransform{ngram=2 bits=18 all=+} charExtractor={}}
# Calculate trichargrams
xf=TextTransform{col=QueryTrichar:Query wordExtractor={} charExtractor=NGramHashExtractorTransform{ngram=3 bits=15 all=-}}
xf=TextTransform{col=DocTrichar:Doc wordExtractor={} charExtractor=NGramHashExtractorTransform{ngram=3 bits=15 all=-}}
# Calculate Word Embeddings
xf=TextTransform{col=QueryTokens:Query tokens=+ wordExtractor={} charExtractor={}}
xf=TextTransform{col=DocTokens:Doc tokens=+ wordExtractor={} charExtractor={}}
xf=WordEmbeddingsTransform{col=QueryWordEmbeddings:QueryTokens_TransformedText col=DocWordEmbeddings:DocTokens_TransformedText}
# Calculate distance (**new VectorDistanceTransform**)
xf=VectorDistanceTransform{col=DistUnigramBigram:QueryUnigramBigram,DocUnigramBigram}
xf=VectorDistanceTransform{col=DistTrichar:QueryTrichar,DocTrichar}
xf=VectorDistanceTransform{col=DistWordEmbeddings:QueryWordEmbeddings,DocWordEmbeddings}
# Concatenate together for final Feature vector
xf=Concat{col=Features:DistUnigramBigram,DistTrichar,DistWordEmbeddings,QueryLevelFeatures,DocLevelFeatures}
tr=FastTreeRanking{l2=0 m=5 initwts=0.1} sf=D:\resume_model\1.model.txt out=D:\resume_model\1.model.zip
Issue
What did you do? When I used SSWE WordEmbedding, I'd like to have VectorDistanceTranform to compute similarity score between two sentiment vectors. In my use case, I don't need to maintain two sentiment vectors beyond the similarity score.
What did you expect? I would like to have VectorDistanceTransform to compute similarity score.