elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.94k stars 24.74k forks source link

MLT bug when source disabled? #2914

Closed rlvoyer closed 9 years ago

rlvoyer commented 11 years ago

I'm trying to use the more_like_this handler in almost the exact same way it's used in the documentation here:

http://www.elasticsearch.org/guide/reference/api/more-like-this/

curl -XGET "http://localhost:9200/foo/document/1008534/_mlt?mlt_fields=cs,ks,tpcs&min_doc_freq=2"

{"error":"ElasticSearchException[No fields found to fetch the 'likeText' from]","status":500}

I'm guessing this bug stems from the fact that source is disabled, but I'm not really sure. If it is the case that source is required for MLT, you should document that fact.

s1monw commented 11 years ago

I think you either need the source or the field needs to be stored or you need to store term vectors for the field. But I agree we should document that!

thanks for raising this... what is your mapping for those fields?

rlvoyer commented 11 years ago
{
    "document": {
        "_source" : {
            "enabled" : false
        },
        "term_vector": "yes",
        "dynamic": false,
        "properties": {
            "_id": {
                "type": "long", 
                "index": "not_analyzed"
            },
            "cs": {
                "type": "string", 
                "analyzer": "keyword",
                "store": "no"
            }, 
            "ks": {
                "type": "string", 
                "analyzer": "keyword", 
                "store": "no"
            },
            "tpcs": {
                "type": "string",
                "analyzer": "keyword", 
                "store": "no"
            }
        }
    }
}
s1monw commented 11 years ago

ah I see you should put term_vector next to store for each filed you want to store term vectors. Can you try that?

like this:

{
  "type" : "string",
  "store" : "no",
  "term_vector" : "yes"
}

simon

s1monw commented 11 years ago

I pushed a fix to the documentation: https://github.com/elasticsearch/elasticsearch.github.com/commit/25614ced9513e24dc3ad99b976b00e8c384ff9f2

rlvoyer commented 11 years ago

Thanks -- I'll make that fix. What is the effect (if any) of enabling term_vector storage at the top-level as I have done here?

s1monw commented 11 years ago

hmm it seems that this only works if it's stored or you enabled source. we should be able to support this if TV are stored for the fields as well... reopening

rlvoyer commented 11 years ago

Hey @s1monw -- have you had an opportunity to look into this issue?

kimchy commented 11 years ago

I am not a fan of supporting it for tern vector and no store, cause then we need to get that info(TV) from the document on the specific shard and then send it to all the shards to do the MLT based on it. Just store the source and MLT based on that. You can also, btw, always use the MLT query as part of a search request and provide the text there externally.

rlvoyer commented 11 years ago

@kimchy can you explain how storing the source alleviates the problem of distributing the term vector to all the shards for the MLT computation?

kimchy commented 11 years ago

cause with the source text to do MLT by, you don't need the term vectors.

s1monw commented 11 years ago

I agree this seems odd... isn't the TV just a different representation of a field?

rlvoyer commented 11 years ago

@kimchy @s1monw so why store the term vectors at all? (I was only storing them because of the following doc: http://www.elasticsearch.org/guide/reference/api/more-like-this/) If MLT doesn't need them when it has the source text, does it then recompute term vectors given the source text?

s1monw commented 11 years ago

I agree this should also work on TV though. yet at this point it doesn't so you might want to get rid of TV if you don't need them.

rlvoyer commented 11 years ago

@kimchy @s1monw I'd like to try to write a plugin similar to more-like-this that does exactly what I want. Can you suggest any plugins that access term vectors that I might use as references? Any tips / documentation are much appreciated.

s1monw commented 11 years ago

hey, we just added TermVector support lately. this issue is on our list to make use of the feature. Can you wait for it?

rlvoyer commented 11 years ago

@s1monw Unfortunately, my company has a rapidly narrowing window for determining whether elasticsearch is right for the problem we're trying to solve. Given that the current built-in functionality doesn't seem to handle our use-case, a plugin seems like our only option in the short-term.

Signum commented 10 years ago

Excuse me but I'm currently trying to use the MLT feature. I read http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-more-like-this.html#search-more-like-this and either my english is completely bad of I have not the remotest idea what it is supposed to mean:

"Note: In order to use the mlt feature a mlt_field needs to be either be stored, store term_vector or source needs to be enabled."

What is "stored"? Which "source"? I've been searching the internet for two hours now and can't any example of how to use MLT successfully. And to be honest this issue report doesn't help me either. Could anyone shed some light on it and fix the documentation please?

s1monw commented 10 years ago

In Elasticsearch you can either store the entire document (the json you send to ES when you index) aka. the source or you can mark a field as stored : true then we only store the value of that particular field. By default the source is stored (or enabled) but you can also disable it via the mapping. The term_vectors don't work yet with MLT hence this issue.

hope that helps

Signum commented 10 years ago

@s1monw Thanks for the reply. So to rephrase: any field I'm using as "mlt_fields=..." needs to

Okay. In my case the documents contain two fields. Example:

{
_index: "debshots",
_type: "jdbc",
_id: "396",
_version: 35,
exists: true,
_source: {
    description: "Alarm Clock for GTK Environments",
    name: "alarm-clock"
    }
}

But when I'm GETting http://localhost:9200/debshots/jdbc/396/_mlt Elasticsearch returns zero results:

{
took: 3,
timed_out: false,
_shards: {
    total: 1,
    successful: 1,
    failed: 0
    },
hits: {
    total: 0,
    max_score: null,
    hits: [ ]
    }
}

There are many other documents with a description like "Alarm curl plugin for uWSGI" so I had assumed that at least the "Alarm" is a term that makes it "more-like-that"-style.

I'd welcome a hint what is going wrong here. Thanks.

And I would also welcome a rewrite of that quoted phrase in the documentation because it's wrong english and hard to understand. (I still didn't.)

s1monw commented 10 years ago

Can you take this please to the mailing list this is only for development issues.

thanks

Signum commented 10 years ago

@s1monw Will do. Please still consider rewriting this sentence in the documentation to make it understandable.

alexksikes commented 9 years ago

This issue is now outdated, closing.