blevesearch / bleve

A modern text/numeric/geo-spatial/vector indexing library for go
Apache License 2.0
10.08k stars 683 forks source link

"slice bounds out of range error" when using cjk analyzer and highlight #178

Closed wangbin closed 9 years ago

wangbin commented 9 years ago

Hi, I'm learning bleve, following is a small program modified from wiki:

package main

import (
    "fmt"

    "github.com/blevesearch/bleve"
)

func main() {
    // open a new index
    mapping := bleve.NewIndexMapping()
    mapping.DefaultAnalyzer = "cjk"
    index, err := bleve.New("example.bleve", mapping)
    if err != nil {
        fmt.Println(err)
        return
    }

    data := struct {
        Name string
    }{
        Name: "交换机",
    }

    // index some data
    index.Index("id", data)

    // search for some text
    query := bleve.NewMatchQuery("交换机")
    search := bleve.NewSearchRequest(query)
    search.Highlight = bleve.NewHighlight()
    searchResults, err := index.Search(search)
    if err != nil {
        fmt.Println(err)
        return
    }
    fmt.Println(searchResults)
}

When run the code I got a panic:

panic: runtime error: slice bounds out of range

goroutine 1 [running]:
github.com/blevesearch/bleve/search/highlight/fragmenters/simple.(*Fragmenter).Fragment(0xc20800adc0, 0xc022a5, 0x9, 0x7ffffe1f, 0xc20800be50, 0x2, 0x2, 0x0, 0x0, 0x0)
        /Users/wangbin/mygo/src/github.com/blevesearch/bleve/search/highlight/fragmenters/simple/fragmenter_simple.go:82 +0x8f0
github.com/blevesearch/bleve/search/highlight/highlighters/simple.(*Highlighter).BestFragmentsInField(0xc20803b260, 0xc20802b200, 0xc20802b2c0, 0x4922e0, 0x4, 0x1, 0x0, 0x0, 0x0)
        /Users/wangbin/mygo/src/github.com/blevesearch/bleve/search/highlight/highlighters/simple/highlighter_simple.go:84 +0x403
github.com/blevesearch/bleve.(*indexImpl).Search(0xc2080116c0, 0xc20802eff0, 0x0, 0x0, 0x0)
        /Users/wangbin/mygo/src/github.com/blevesearch/bleve/index_impl.go:447 +0x13a0
main.main()
        /Users/wangbin/tmp/bl/bl2.go:32 +0x34a

goroutine 5 [chan receive]:
github.com/blevesearch/bleve/index/upside_down.AnalysisWorker(0xc208030120)
        /Users/wangbin/mygo/src/github.com/blevesearch/bleve/index/upside_down/analysis_pool.go:40 +0x73
created by github.com/blevesearch/bleve/index/upside_down.NewAnalysisQueue
        /Users/wangbin/mygo/src/github.com/blevesearch/bleve/index/upside_down/analysis_pool.go:32 +0x66

goroutine 6 [chan receive]:
github.com/blevesearch/bleve/index/upside_down.AnalysisWorker(0xc208030120)
        /Users/wangbin/mygo/src/github.com/blevesearch/bleve/index/upside_down/analysis_pool.go:40 +0x73
created by github.com/blevesearch/bleve/index/upside_down.NewAnalysisQueue
        /Users/wangbin/mygo/src/github.com/blevesearch/bleve/index/upside_down/analysis_pool.go:32 +0x66

goroutine 7 [chan receive]:
github.com/blevesearch/bleve/index/upside_down.AnalysisWorker(0xc208030120)
        /Users/wangbin/mygo/src/github.com/blevesearch/bleve/index/upside_down/analysis_pool.go:40 +0x73
created by github.com/blevesearch/bleve/index/upside_down.NewAnalysisQueue
        /Users/wangbin/mygo/src/github.com/blevesearch/bleve/index/upside_down/analysis_pool.go:32 +0x66

goroutine 8 [chan receive]:
github.com/blevesearch/bleve/index/upside_down.AnalysisWorker(0xc208030120)
        /Users/wangbin/mygo/src/github.com/blevesearch/bleve/index/upside_down/analysis_pool.go:40 +0x73
created by github.com/blevesearch/bleve/index/upside_down.NewAnalysisQueue
        /Users/wangbin/mygo/src/github.com/blevesearch/bleve/index/upside_down/analysis_pool.go:32 +0x66
exit status 2

I did some research, in Fragment function, I got two TermLocations:

Term: 交换, start: 65, end: 71
Term: 换机, start: 68, end: 74

The first one's End is larger than the second's Start, which cause the problem in line 82, my guess is the code should check maxbegin is bigger than start to avoid this problem?

mschoch commented 9 years ago

Thanks for reporting this.

The problem is that the code is trying to avoid having two highlighted sections overlap. Unfortunately, when you use analyzers that compute n-grams, the tokens are overlapping by design.

I'm reworking the highlighters to still avoid overlapping fragments in general, but allow for them when the tokens themselves overlap.