Closed GenweiWu closed 6 years ago
o19s/elasticsearch-learning-to-rank · GitHub
elasticsearch-6.1.2.zip
kibana-6.1.2-windows-x86_64.zip
x-pack-6.1.2.zip
ltr-1.0.0-es6.1.2.zip
x-pack
X-pack是elasticsearch的一个扩展包,将安全,警告,监视,图形和报告功能捆绑在一个易于安装的软件包中,虽然x-pack被设计为一个无缝的工作,但是你可以轻松的启用或者关闭一些功能。收费软件。
安装步骤 Installing X-Pack in Elasticsearch | Elasticsearch Reference [6.1] | Elastic
手动安装方法:
D:\2222\smartSearch\elasticsearch-6.1.2\bin>elasticsearch-plugin install file:///E:/software/elastic/6.1.2/x-pack-6.1.2.zip
安装了 x-pack 之后访问受到限制,访问需要提供用户名和密码;
默认的用户名:elastic
bin/x-pack/setup-passwords interactive
设置新密码x-pack
参考Elasticsearch安装
x-pack
ltr
o19s/elasticsearch-learning-to-rank · GitHub
D:\2222\smartSearch\elasticsearch-6.1.2\bin>elasticsearch-plugin install file:///E:/software/elastic/6.1.2/ltr-1.0.0-es6.1.2.zip
D:\2222\smartSearch\elasticsearch-6.1.2\bin>elasticsearch-plugin list
ltr
x-pack
TMDB Data
和Ranklib Jar
$ python prepare.py
GET http://es-learn-to-rank.labs.o19s.com/tmdb.json
GET http://es-learn-to-rank.labs.o19s.com/RankLib-2.8.jar
python indexMlTmdb.py
在windows下会报错:
'gbk' codec can't encode character '\u0153'
解决方法:
print()函数的局限就是Python默认编码的局限,因为系统是win7的,python的默认编码不是'utf-8',改一下python的默认编码成'utf-8'就行了。 参考:解决python3 UnicodeEncodeError:'gbk' codec can't encode character '\xXX')
import io import sys import urllib.request sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8') #改变标准输出的默认编码 res=urllib.request.urlopen('http://www.baidu.com') htmlBytes=res.read() print(htmlBytes.decode('utf-8'))
问题
plugin:elasticsearch Authentication Exception
解决方法:
In your Kibana installation in /config/kibana.yml you have to set the username and password that Kibana should use to access Elasticsearch.
2. 问题
action [ltr:featurestore/data] is unauthorized for user [elastic]
解决方法:
set xpack.security.enabled=false in
kibana.yml
andelasticsearch.yml
下面的程序执行报错:Unsupported major.minor version 52.0; java_home明明对应1.8的jdk但是python中执行
java -version
得到的是1.6
test.py
import os
import subprocess
str = "java -version" #1.6
os.popen(str).read()
str = "echo %java_home%" #1.8
# os.popen(str).read()
os.popen('java -jar RankLib-2.8.jar -ranker 0 -train sample_judgments_wfeatures.txt -save model.txt -frate 1.0').read()
解决方法: 控制面板>程序>java,发现的确有个jdk1.6,去程序中删除安装的jdk1.6即可。
执行到这儿需要用户名和密码,则先安装x-pack插件
python prepare_xpack.py <xpack admin username>
个人觉得,其实我们可以不用xpack
对应kibana的devtools工具的请求测试
GET /tmdb/_search
GET /tmdb/movie/_search
GET /tmdb/_mapping
GET /tmdb/movie/_mapping
GET /tmdb/movie/_search
{
"_source": [
"title",
"overview"
],
"query": {
"multi_match": {
"query": "iron man",
"fields": [
"title",
"overview"
]
}
},
"rescore": {
"query": {
"rescore_query": {
"sltr": {
"params": {
"keywords": "iron man"
},
"model": "test_6"
}
}
}
}
}
GET /_ltr/_featureset/movie_features
GET /tmdb/_search
{
"size": 100,
"query": {
"bool": {
"must": [
{
"terms": {
"_id": [
"7555",
"1370",
"1369",
"1368",
"136278",
"102947",
"13969",
"61645",
"14423",
"54156"
]
}
}
],
"should": [
{
"sltr": {
"params": {
"keywords": "rambo"
},
"_name": "logged_featureset",
"featureset": "movie_features"
}
}
]
}
},
"ext": {
"ltr_log": {
"log_specs": {
"name": "main",
"named_query": "logged_featureset",
"missing_as_zero": true
}
}
}
}
python prepare_xpack.py <xpack admin username>
post了一些xpack的用户角色信息(略过)
python prepare.py
顾名思义,就是下载了两个文件 tmdb.json
和Ranklib.jar
python indexMlTmdb.py
删除tmdb并重新创建
index="tmdb"
es.indices.delete(index, ignore=[400, 404])
es.indices.create(index, body=settings)
post数据
movieDict = json.loads(open('tmdb.json').read())
tmdb
索引下
addCmd = {"_index": index,
"_type": "movie",
"_id": id,
"_source": movie}
yield addCmd
可以通过
GET /tmdb/_search
或GET /tmdb/movie/_search
来验证,由于所有的数据的_type
都是movie,所以上面两种请求没啥区别
python train.py
DELETE /_ltr
PUT /_ltr
因为查询5里的
featureset
字段用到了POST /_ltr/_featureset/movie_features { "featureset": { "features": [ { "params": [ "keywords" ], "name": "1", "template": { "match": { "title": "{{keywords}}" } } }, { "params": [ "keywords" ], "name": "2", "template": { "match": { "overview": "{{keywords}}" } } } ], "name": "movie_features" } }
step 1
# qid:1: rambo # qid:2: rocky # qid:3: bullwinkle
得到
{1: 'rambo', 2: 'rocky', 3: 'bullwinkle'}
step 2
4 qid:1 # 7555 Rambo
得到
docId = {str} '7555' grade = {int} 4 keywords = {str} 'rambo' qid = {int} 1
且最终根据qid进行归类
step 3 每个qid作为一次集合,进行查询操作
GET /tmdb/_search
{
"size": 100,
"query": {
"bool": {
"must": [
{
"terms": {
"_id": [
"7555",
"1370",
"1369",
"1368",
"136278",
"102947",
"13969",
"61645",
"14423",
"54156"
]
}
}
],
"should": [
{
"sltr": {
"params": {
"keywords": "rambo"
},
"_name": "logged_featureset",
"featureset": "movie_features"
}
}
]
}
},
"ext": {
"ltr_log": {
"log_specs": {
"name": "main",
"named_query": "logged_featureset",
"missing_as_zero": true
}
}
}
}
然后计算得到features
,最终写入文件sample_judgments_wfeatures.txt
docId = {str} '7555'
features = {list} <class 'list'>: [12.318446, 10.573845]
0 = {float} 12.318446
1 = {float} 10.573845
grade = {int} 4
keywords = {str} 'rambo'
qid = {int} 1
上面获取评分的方法
GET /tmdb/_search
{
"size": 100,
"query": {
"bool": {
"must": [
{
"terms": {
"_id": [
"7555",
"1370",
"1369",
"1368",
"136278",
"102947",
"13969",
"61645",
"14423",
"54156"
]
}
}
],
"should": [
{
"sltr": {
"params": {
"keywords": "rambo"
},
"_name": "logged_featureset",
"featureset": "movie_features"
}
}
]
}
},
"ext": {
"ltr_log": {
"log_specs": {
"name": "main",
"named_query": "logged_featureset",
"missing_as_zero": true
}
}
}
}
参照的教程
Learning to Rank Demo
This demo uses data from TheMovieDB (TMDB) to demonstrate using Ranklib learning to rank models with Elasticsearch.
Install Dependencies and prep data...
This demo requires
elasticsearch
andrequests
librarieselasticsearch_xpack
libraries if xpack support is necessaryAn aside: X Pack
Using the LTR plugin with xpack requires configuring appropriate roles. These can be setup automatically by
prepare_xpack.py
which takes a username and will prompt for a password. After this is runsettings.cfg
must be edited to uncomment the ESUser and ESPassword properties.Download the TMDB Data & Ranklib Jar
The first time you run this demo, fetch RankLib.jar (used to train model) and tmdb.json (the dataset used)
Start Elasticsearch/install plugin
Start a supported version of Elasticsearch and follow the instructions to install the learning to rank plugin.
Index to Elasticsearch
This script will create a 'tmdb' index with default/simple mappings. You can edit this file to play with mappings.
Onto the machine learning...
TLDR
If you're actually going to build a learning to rank system, read past this section. But to sum up, the full Movie demo can be run by
Then you can search using
and search results can be printed to the console.
More on how all this actually works below:
Create and upload features (loadFeatures.py)
A "feature" in ES LTR corresponds to an Elasticsearch query. The score yielded by the query is used to train and evaluate the model. For example, if you feel that a TF*IDF title score corresponds to higher relevance, then that's a feature you'd want to train on! Other features might include how old a movie is, the number of keywords in a query, or whatever else you suspect might correlate to your user's sense of relevance.
If you examine loadFeatures.py you'll see how we create features. We first initialize the default feature store (
PUT /_ltr
). We create a feature set (POST /_ltr/_featureset/movie_features
). Now we have a place to create features for both logging & use by our models!In the demo features 1...n json are mustache templates that correspond to the features. In this case, the features are identified by ordinal (feature 1 is in 1.json). They are uploaded to Elasticsearch Learning to Rank with these ordinals as the feature name. In
eachFeature
, you'll see a loop where we access each mustache template an the file system and return a JSON body for adding the feature to Elasticsearch.For traditional Ranklib models, the ordinal is the only way features are identified. Other models use feature names which make developing, logging, and managing features more maintainable.
Gather Judgments (sample_judgments.txt)
The first part of the training data is the judgment list. We've provided one in sample_judgments.txt.
What's a judgment list? A judgment list tells us how relevant a document is for a search query. In other words, a three-tuple of
Quality comes in the form of grades. For example if movie "First Blood" is considered extremely relevant for the query Rambo, we give it a grade of 4 ('exactly relevant'). The movie Bambi would receive a '0'. Instead of the notional CSV format above, Ranklib and other learning to rank systems use a format from LibSVM, shown below:
You'll notice we bastardize this syntax to add comments identifying the keywords associated with each query id, and append metadata to each line. Code provided in judgments.py handles this syntax.
Log features (collectFeatures.py)
You saw above how we created features, the next step is to log features for each judgment 3-tuple. This code is in collectFeatures.py. Logging features can be done in several different contexts. Of course, in a production system, you may wish to log features as users search. In other contexts, you may have a hand-created judgment list (as we do) and wish to simply ask Elasticsearch Learning to Rank for feature values for query/document pairs.
Is collectFeatures.py, you'll see an
sltr
query is included. This query points to a featureSet, not a model. So it does not influence the score. We filter down to needed document ids for each keyword and allow thissltr
query to run.You'll also notice an
ext
component in the request. This search extension is part of the Elasticsearch Learning to Rank plugin and allows you to configure feature logging. You'll noticed it refers to the query name ofsltr
, allowing it to pluck out thesltr
query and perform logging associated with the feature set.Once features are gathered, the judgment list is fleshed out with feature value, the ordinals below corresponding to the features in our 1..n.json files.
Train (train.py and RankLib.jar)
With training data in place, it's time to ask RankLib to train a model, and output to a test file. RankLib supports linear models, ListNet, and several tree-based models such as LambdaMART. In train.py you'll notice how RankLib is called with command line arguments. Models
test_N
are created in our feature store for each type of RankLib model. In thesaveModel
function, you can see how the model is uploaded to our "movie_features" feature set.Search using the model (search.py)
See what sort of search results you get! In
search.py
you'll see we execute thesltr
query referring to atest_N
model in the rescore phase. By defaulttest_6
is used (corresponding to LambdaMART), but you can change the used model at the command line.Search with default LambdaMART:
Try a different model: