dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.03k stars 1.88k forks source link

Please add vector distance transform #2126

Open taohu88 opened 5 years ago

taohu88 commented 5 years ago

Issue

justinormont commented 5 years ago

This would offer us the ability to calculate the similarity between strings/phrases. String similarity has a been a requested features for a while.

This also helps light up ranking without requiring the user having to pre-calculate the (query vs. document) features. Normally in learning-to-rank, you have three styles of features, (1) Query level features (eg. frequency of query, num words in query), (2) Document level features (page length, domain popularity), and (3) Query-dependent features (eg. num of common words in doc & query, word embedding distance)

This transform helps calculate #3.

A VectorDistanceTransform can be used in a few ways:

In a ranking pipeline, this would look like: (note command line syntax for semi-terseness)

CV data=MyRankingDataset.tsv loader=TextLoader{sep=, col=Label:R4:0 col=GroupId:TX:1 
 col=Query:TX:2 col=Doc:TX:3 col=QueryLevelFeatures:R4:4-53 col=DocLevelFeatures:R4:54-103 header=+} prexf=HashTransform{col=GroupId}

# Calculate Unigrams+Bigrams
xf=TextTransform{col=QueryUnigramBigram:Query wordExtractor=NGramHashExtractorTransform{ngram=2 bits=18 all=+} charExtractor={}} xf=TextTransform{col=DocUnigramBigram:Doc wordExtractor=NGramHashExtractorTransform{ngram=2 bits=18 all=+} charExtractor={}} 

# Calculate trichargrams
xf=TextTransform{col=QueryTrichar:Query wordExtractor={} charExtractor=NGramHashExtractorTransform{ngram=3 bits=15 all=-}}
xf=TextTransform{col=DocTrichar:Doc wordExtractor={} charExtractor=NGramHashExtractorTransform{ngram=3 bits=15 all=-}}

# Calculate Word Embeddings
xf=TextTransform{col=QueryTokens:Query tokens=+ wordExtractor={} charExtractor={}}
xf=TextTransform{col=DocTokens:Doc tokens=+ wordExtractor={} charExtractor={}}
xf=WordEmbeddingsTransform{col=QueryWordEmbeddings:QueryTokens_TransformedText col=DocWordEmbeddings:DocTokens_TransformedText}

# Calculate distance (**new VectorDistanceTransform**)
xf=VectorDistanceTransform{col=DistUnigramBigram:QueryUnigramBigram,DocUnigramBigram}
xf=VectorDistanceTransform{col=DistTrichar:QueryTrichar,DocTrichar}
xf=VectorDistanceTransform{col=DistWordEmbeddings:QueryWordEmbeddings,DocWordEmbeddings}

# Concatenate together for final Feature vector
xf=Concat{col=Features:DistUnigramBigram,DistTrichar,DistWordEmbeddings,QueryLevelFeatures,DocLevelFeatures}

tr=FastTreeRanking{l2=0 m=5 initwts=0.1} sf=D:\resume_model\1.model.txt out=D:\resume_model\1.model.zip