Closed dohykim closed 7 years ago
can you provide an example of what is not in the order you expect and why and preferably what you would it expect to be?
I use elasticsearch to service fulltext search on office documents. Users upload their office document (excel, pdf ... etc) and search with multi keyword AND operator. I expected the more distinct highlighted term in fragments get better score. fragment scoring version Query Coordination
The problem rise when execute 'high TF term + low TF term multi keyword search'. in example, I search on linked file by keyword "test acceptable".
I tested in ES-1.5.2, but I see use of SimpleFragListBuilder not changed in current version.
Query :
post english/typeMain/_search { "size": 1, "fields": [ "highlight", "fileName" ], "query": { "match": { "contents": { "query": "test acceptable", "operator": "and" } } }, "highlight": { "order": "score", "fields": { "contents": { "fragment_size": 128, "number_of_fragments": 1000 } } } }
Ordering result, Original =======
"hits": { "total": 16, "max_score": 0.3831197, "hits": [ { "_index": "english", "_type": "typeMain", "_id": "12419", "_score": 0.3831197, "highlight": { "contents": [ "OPIc rubrics and test items, OPIc criteria, OPIc test preparation, OPIc tests in use, and English speaking tests. It was found that", "Bradshaw, J. “Test-takers reactions to a placement test.” Language Testing, 7 (1990): 13-30. Brown, A. “The role of test-taker feedback", "from the test institution. (Chung-Ang University) Key Words: ACTFL proficiency guidelines, oral proficiency testing, test-takers ", "of Foreign Languages Testing Committee - Korea (ATCK). It is now considered by test administrators and test takers that a good OPIc", "overviews the OPIc testing procedure, he or she performs tasks in an actual test. During this process, the test-taker can listen to", "about the test (Kenyon & Malabonga, 2001). Last, the computer technology facilitates test results and ratings. Once OPIc test-taking", "considered under real test conditions (Bachman, 1990). Test validation had been left to professionals concerned with test development and",
"grown regarding how examinees perceive tests, including whether a given test is acceptable to its users or not (Davies et al., 1999)",
"several considerations as perceived by test-takers. First, test-takers perceptions include test validity. It has been suggested that", "difficulty are now accepted by test-developers when undertaking task revision (Alderson, 1998). Second, test-takers reactions are", "Examinees perceptions may assist in test revision and overall acceptability as well as test validity from their perspectives. III", "response time. In the following parts, test-takers perceptions on OPIc test format, preparation and test results in use will be presented", "differences between test-takers attitudes with regard to test preparation and the intent of OPIc. Table 7. OPIc Test Preparation Statement", "8 40.\u001f\u001fAn OPIc test can be a substitute for \u001fthe TOEIC test. 3.29 1.06 47.6 22.0 4.3 English Speaking Tests Table 9 shows general", "discussed OPIc test-takers perceptions regarding ACTFL and ACTFL proficiency guidelines, OPIc test format, test preparation, and", "Korea, English speaking tests are frequently used, although test-specific strategies are used mainly to raise test scores rather than", " Hence there is a major gap between test preparation and the intent of the test itself. Test-takers should be provided with sufficient", "develop proficiency testing materials, to use oral proficiency test scores appropriately, and to teach to the test. As noted in Table",
And I switch SimpleFragListBuilder to WeightedFragListBuilder and build. (org.elasticsearch.search.highlight.FastVectorHighlighter.java)
` if (field.fieldOptions().numberOfFragments() == 0) { fragListBuilder = new SingleFragListBuilder();
if (!forceSource && mapper.fieldType().stored()) {
fragmentsBuilder = new SimpleFragmentsBuilder(mapper, field.fieldOptions().preTags(), field.fieldOptions().postTags(), boundaryScanner);
} else {
fragmentsBuilder = new SourceSimpleFragmentsBuilder(mapper, context, hitContext, field.fieldOptions().preTags(), field.fieldOptions().postTags(), boundaryScanner);
}
} else {
fragListBuilder = field.fieldOptions().fragmentOffset() == -1 ? new WeightedFragListBuilder() : new WeightedFragListBuilder(field.fieldOptions().fragmentOffset());
if (field.fieldOptions().scoreOrdered()) {
if (!forceSource && mapper.fieldType().stored()) {
fragmentsBuilder = new ScoreOrderFragmentsBuilder(field.fieldOptions().preTags(), field.fieldOptions().postTags(), boundaryScanner);
} else {
fragmentsBuilder = new SourceScoreOrderFragmentsBuilder(mapper, context, hitContext, field.fieldOptions().preTags(), field.fieldOptions().postTags(), boundaryScanner);
}
} else {
if (!forceSource && mapper.fieldType().stored()) {
fragmentsBuilder = new SimpleFragmentsBuilder(mapper, field.fieldOptions().preTags(), field.fieldOptions().postTags(), boundaryScanner);
} else {
fragmentsBuilder = new SourceSimpleFragmentsBuilder(mapper, context, hitContext, field.fieldOptions().preTags(), field.fieldOptions().postTags(), boundaryScanner);
}
}
} `
Result : "hits": { "total": 16, "max_score": 0.3831197, "hits": [ { "_index": "english", "_type": "typeMain", "_id": "12419", "score": 0.3831197, "highlight": { "contents": [ "grown regarding how examinees perceive tests, including whether a given test is acceptable to its users or not (Davies et al., 1999)",_ "difficulty are now accepted by test-developers when undertaking task revision (Alderson, 1998). Second, test-takers reactions are", "Examinees perceptions may assist in test revision and overall acceptability as well as test validity from their perspectives. III", "OPIc rubrics and test items, OPIc criteria, OPIc test preparation, OPIc tests in use, and English speaking tests. It was found that", "Bradshaw, J. “Test-takers reactions to a placement test.” Language Testing, 7 (1990): 13-30. Brown, A. “The role of test-taker feedback", "from the test institution. (Chung-Ang University) Key Words: ACTFL proficiency guidelines, oral proficiency testing, test-takers ", "of Foreign Languages Testing Committee - Korea (ATCK). It is now considered by test administrators and test takers that a good OPIc", "overviews the OPIc testing procedure, he or she performs tasks in an actual test. During this process, the test-taker can listen to", "about the test (Kenyon & Malabonga, 2001). Last, the computer technology facilitates test results and ratings. Once OPIc test-taking", "considered under real test conditions (Bachman, 1990). Test validation had been left to professionals concerned with test development and", "several considerations as perceived by test-takers. First, test-takers perceptions include test validity. It has been suggested that", "response time. In the following parts, test-takers perceptions on OPIc test format, preparation and test results in use will be presented", "differences between test-takers attitudes with regard to test preparation and the intent of OPIc. Table 7. OPIc Test Preparation Statement", "8 40.\u001f\u001fAn OPIc test can be a substitute for \u001fthe TOEIC test. 3.29 1.06 47.6 22.0 4.3 English Speaking Tests Table 9 shows general", "discussed OPIc test-takers perceptions regarding ACTFL and ACTFL proficiency guidelines, OPIc test format, test preparation, and", "Korea, English speaking tests are frequently used, although test-specific strategies are used mainly to raise test scores rather than", " Hence there is a major gap between test preparation and the intent of the test itself. Test-takers should be provided with sufficient", "develop proficiency testing materials, to use oral proficiency test scores appropriately, and to teach to the test. As noted in Table", "informed of appropriate and ethical test content and test procedures. This study has shown how Korean test-takers perceive ACTFL and OPIc", "feedback in the test development process: test-takers reactions to a tape-mediated test of proficiency in spoken Japanese.” Language", "preparation and test results currently in use. The source of data for this research is a questionnaire filled in by test-takers. Results", "commercialized tests such as Oral Proficiency Interview-computer (OPIc), TOEIC Speaking and TOEFL-IBT. As in other tests of speaking", "speaking assessment, the number of OPIc test-takers has been increasing every year since the test was first administrated in 2007 by American",
Tested on 6.0 with unified highlighting - the unified highlighter is not returning the result matching the most different words with lowest IDF first:
GET t/_search
{
"size": 1,
"_source": false,
"query": {
"match": {
"text": {
"query": "test acceptable",
"operator": "and"
}
}
},
"highlight": {
"type": "unified",
"order": "score",
"fields": {
"text": {
"fragment_size": 128,
"number_of_fragments": 1000
}
}
}
}
Returns:
Tested on 6.0 with unified highlighting - the unified highlighter is not returning the result matching the most different words with lowest IDF first:
When using the plain
mode of the unified highlighter the IDF of the terms is missing. When using the unified_postings
mode of the highlighter, the IDF is taking in account and the snippet with the term "acceptable" is ranked first. I'll open a ticket in Lucene since it should be feasible to retrieve the IDF of the terms even if we're using the plain highlighter (as long as the field is indexed).
And just to be clear, the unified
highlighter treats each document as a corpus and each passage as a document. The IDF in this context is the number of times the term appears in the document and the TF is the number of times the term appears inside the passage that we're trying to score.
I just got there's new unified highlighter. what I used is fvh. and I see fvh as a one mode in unified highlighter.
I want to say one thing more about fvh highlighter. That is a method [getBestFragments] in lucene. ES use this in FastVectorHighlighter.java line 143 and 146. It's lucene's method that get one field's highlight result. It loads a termvector of doc in segment file in storage maybe. It loads [number of highlight field] time and very slow when query has many highlight field and even doc is big it worse. Even don't care about the field is matched.(I guess Fetch phase is independent form Query phase). I think it's speacial use case. I sets many highlight fields because of requirement that our business service multi-lingual environment.
I suffered a performance issue. I have suggested this but some guy says to me "go to lucene". I think this issue also concern with ES. If yours improve highlighter things this time, This inefficiency should take account of. I had solved this issue in my service with load termvector once in ES code and lucene [getBestFragments] method receive termvector object. So just "one load termvector operation" per "one document". I also created in lucene JIRA. https://issues.apache.org/jira/browse/LUCENE-7397?filter=-2
Closing this, the unified
highlighter returns the expected snippet.
when highlight, "order" : "score", the returned fragments are not match with that I expected. In lately Lucene, WeightedFieldFragList is added and It seems work better. I think WeightedFieldFragList has more chances to return various term fragments. it considers distinct term weight and terms per fragment norm. It's hard and unefficient to take it in my application. because of number of fragment limit.
package org.elasticsearch.search.highlight; In line 117 to 122 WeightedFieldFragList than SimpleFieldFragList. or may has ES more highlight order option that use WeightedFieldFragList?