Feature Request: pre-select terms in TermVector request

mhoffman commented 11 years ago

Dear All

I have a feature request regarding the TermVector API. I was really happy to see this commit, which I had fledgingly written as a plugin before. Thanks @brwe ! Though, would it be possible to submit a list of terms and only have the TermVector returned for those? My hope is that it's considerably faster than the full request. I have a use case, where I know the terms for which I need the information before making the request.

Some pointers of how to do it myself are appreciated, too, though I am afraid my solution won't be as efficient.

Best and many, many thanks for the great work. Max.

brwe commented 11 years ago

Thanks! I am glad to hear that the term vector api is useful. I want to add that @bleskes deserves at least half the credit!

There is another issue related to gathering statistics on terms(#3920). Is this similar to what you need? Could you describe your use case?

mhoffman commented 11 years ago

Thanks for the quick response. So big shoutout to @bleskes, too: Thank you! No in my case I am actually after the precise offset, but typically only of a few terms (say 2-5). To describe this in general terms: I am doing a distance based scoring for things in the text. Can I send you a prototype privately?

brwe commented 11 years ago

If you by privately mean that you want to push a branch to your repo without pull request and then discuss it here - sure! Just link the commit or branch here.

mhoffman commented 11 years ago

Thanks, for taking the time. I have tried to do that in a low-brow way and the result is at User/Project@SHA: mhoffman/elasticsearch@5ed795c9306ced7da937c55fef5f8c90683266c2 .

Now, I have two problems so far:

There is some speed-up compared to a full termvector, but it is less than I thought
If the selected term(s) is not in the document, elasticsearch returns a 500, EOFException. Which is probably not desirable.

Any suggestions?

brwe commented 11 years ago

Thanks for the commit! I assumed you wanted to discuss e7c1e9e980da986, since 5ed795c9306ced7da does some percolator things, right? Let me know if I am mistaken.

About the speed-up: I could imagine the speedup is not as great as expected because all term info is loaded from disc no matter how many of terms you actually return. Disc IO probably influences the performance most. I still think the change is useful for large documents with many terms when only few are requested. What kind of documents did you use for measuring the speedup?

About the EOF: We should decide if a requested term should be returned with frequency 0 if it is not in the doc, or not return missing terms at all. I prepared a commit fixing the EOF for the former and added a comment for the latter, just so that you know where the changes have to be made here: 27bc8fd813ae8d342527c4f7d4d36e5b4bcaaddf

What do you think: Return the term with frequency 0 or not return it at all if it is not the document?

mhoffman commented 11 years ago

Thanks for the corrections! Sorry, about the confusion with the commit link. I must have gotten something wrong with linking into user-specific commits. I am also a bit confused how this 5ed795c you link to comes turn out to be there? I definitely didn't intend to touch any percolator stuff. Is it coincidence that the first 7 characters match?

Anyhow, you seem to have gotten the right commit and in my use-case both solutions work equally well. Though one could argue that given that Disk IO will dominate the runtime of this request anyways, returning a 0 would simply make the result a bit more self-explaining.

Thanks again.

brwe commented 11 years ago

Would you like to make a pull request for your changes?

mhoffman commented 11 years ago

Sorry for the delay. Why don't you go ahead since you have the last version in your repo already.

2013/10/23 Britta Weber notifications@github.com

Would you like to make a pull request for your changes?

— Reply to this email directly or view it on GitHubhttps://github.com/elasticsearch/elasticsearch/issues/3924#issuecomment-26901527 .

Max J. Hoffmann Tel: +4989 289 13807 (office) Room CH62115 TU München Lichtenbergstr. 4 D-85747 Garching

brwe commented 11 years ago

ok, I'll do that.

brwe commented 10 years ago

About the speedup: With the current implementation, we load the whole term vector for a document. This makes sense if you need all the terms or do not know in advance which term is requested but also makes it slow. For pre-selected terms the current implementation (see commits above) we also load the term vector and only keep the terms that we are interested in. I am now wondering if this makes sense at all for pre-selected terms. We do not need the full term vector and could also get all the information from the DocsEnum of the terms - this might be quicker than what we do right now. Summoning @s1monw here because I am unsure if it really is quicker in this case.

Also, please take a look at pr #4161. It implements access to term vectors in a script. I implemented it wrong there as well (load term vectors and get the information from there instead from the standard DocsEnum) but I can fix that. If the script term vector access gives you all you need, than maybe you do not even need the pre-selected term vectors in the _termvector api anymore?

Also, sorry for the delay.

brwe commented 10 years ago

I just pushed the script api for term statistics (pr #4161) Could you check if this allows you to do all you need?

mhoffman commented 10 years ago

Hi

Sorry for the slow reply. The function_score query seems to work as the docs promise (at least using 'mvel' and 'native'). Though my actual problem, that made me look for term vectors etc., is still too complex for this query type: I would need to access the term vector of the parent of the object that I am scoring with the script. I guess this notion was expressed before in #1071 and #761 but apparently easier said than done.

I think this could still be a nice feature and since parent/child object follow the same routing shouldn't all too complicated. Any comments?

2014-01-02 11:33 GMT+01:00 Britta Weber notifications@github.com:

I just pushed the script api for term statistics (pr #4161https://github.com/elasticsearch/elasticsearch/pull/4161) Could you check if this allows you to do all you need?

Reply to this email directly or view it on GitHubhttps://github.com/elasticsearch/elasticsearch/issues/3924#issuecomment-31445583 .

Max J. Hoffmann Tel: +4989 289 13807 (office) Room CH62115 TU München Lichtenbergstr. 4 D-85747 Garching

brwe commented 10 years ago

That would call for a different issue.

Just to be sure: Do you need both parent and child statistics in the same script? Can you elaborate a little on what exactly you need maybe with a small example?

mhoffman commented 10 years ago

Just played with the script_score feature and thinks this allows a really nice developing workflow. Sweet!

My parent/child use case is the following (I hope this makes sense). In my application I store my data with parents and children and each parent has many children (~100). The children are very small in size (4 fields, short strings each) but the parent has one large field (~1e6 characters). So from an index size point of view it makes sense to keep the big field with the parent once instead of with each of the many children

Then the 'functions' part of the 'function_score' request could look like below, and it is run on the children doc_type but the '_parent' refers to the respective parent. I think this could make sense since (if I understand correctly) parent and child are always routed to the same node. I would like to have a look into this, but it might take me a couple of days.

'functions' : [{ 'script_score': { "params" : { 'pos': 520, 'terms': ['this','system'], }, 'script': """ score = 0; for (term: terms) { offsets = _parent['body'].get(term, _OFFSETS | _CACHE); for (offset: offsets){ score = score + exp(-(pos - offset.startOffset)*2 \ (-.0001)); } } score

                        """,
                                    }
                }],

2014-02-12 11:01 GMT+01:00 Britta Weber notifications@github.com:

That would call for a different issue.

Just to be sure: Do you need both parent and child statistics in the same script? Can you elaborate a little on what exactly you need maybe with a small example?

Reply to this email directly or view it on GitHubhttps://github.com/elasticsearch/elasticsearch/issues/3924#issuecomment-34854215 .

Max J. Hoffmann Tel: +4989 289 13807 (office) Room CH62115 TU München Lichtenbergstr. 4 D-85747 Garching

brwe commented 10 years ago

Hi,

sorry for the late reply. Can you check if the has_parent query together with the function score would solve the problem?

Using several documents within a script for computing the score is not supported yet and would need a new issue. Let me know if I can close this one. Thanks!

mhoffman commented 10 years ago

Looks good to me. Am 06.05.2014 08:13 schrieb "Britta Weber" notifications@github.com:

Hi,

sorry for the late reply. Can you check if the has_parent query together with the function score would solve the problem?

Using several documents within a script for computing the score is not supported yet and would need a new issue. Let me know if I can close this one. Thanks!

— Reply to this email directly or view it on GitHubhttps://github.com/elasticsearch/elasticsearch/issues/3924#issuecomment-42270734 .

elastic / elasticsearch

Feature Request: pre-select terms in TermVector request #3924