blevesearch / bleve

A modern text/numeric/geo-spatial/vector indexing library for go
Apache License 2.0
9.83k stars 669 forks source link

How to improve the index speed, and the batch mode can not finished! #831

Open hetao29 opened 6 years ago

hetao29 commented 6 years ago

Hi, I test the file in Linux & Mac, and the index speed very slow use Index(docid,doc),and Use batch mode can not finished. The test source and data test.tar.gz:

package main

import (
    "encoding/json"
    "fmt"
    "github.com/blevesearch/bleve"
    "io/ioutil"
    "log"
    "strconv"
    "time"
)

type Data struct {
    Name string
    Id   int
    T    string
}
type Doc struct {
    Id  string      `json:id`
    Doc interface{} `json:doc`
}

func main() {
    log.SetFlags(log.LstdFlags | log.Lshortfile)
    mapping := bleve.NewIndexMapping()

    index, err := bleve.New("example.bleve", mapping)
    if err != nil {
        index, err = bleve.Open("example.bleve")
        if err != nil {
            fmt.Println(err)
            return
        }

    }
    data, _ := ioutil.ReadFile("test.json.data")
    var docs []Doc
    err = json.Unmarshal(data, &docs)
    if err != nil {
        fmt.Println(err)
        return
    }
    //use index
    /*
    var start = time.Now().Unix()
    log.Println("Start :" + strconv.FormatInt(start, 10))
    end := time.Now().Unix()
    log.Println("End1:" + strconv.FormatInt(end, 10) + ",duration:" + strconv.FormatInt(end-start, 10))
    for _, doc := range docs {
        index.Index(doc.Id, doc.Doc)
    }
    end = time.Now().Unix()
    log.Println("End1:" + strconv.FormatInt(end, 10) + ",duration:" + strconv.FormatInt(end-start, 10))
    */

    //use batch
    start = time.Now().Unix()
    log.Println("Start with batch:" + strconv.FormatInt(start, 10))
    batch := index.NewBatch()
    end = time.Now().Unix()
    log.Println("End2:" + strconv.FormatInt(end, 10) + ",duration:" + strconv.FormatInt(end-start, 10))
    for _, doc := range docs {
        batch.Index(doc.Id, doc.Doc)
    }
    end = time.Now().Unix()
    log.Println("End2:" + strconv.FormatInt(end, 10) + ",duration:" + strconv.FormatInt(end-start, 10))
    index.Batch(batch)
    end = time.Now().Unix()
    log.Println("End2:" + strconv.FormatInt(end, 10) + ",duration:" + strconv.FormatInt(end-start, 10))
    return
}
mschoch commented 6 years ago

It appears that your documents have many fields. In the one example I looked at closely it had over 100 fields. Indexing all of these fields will take additional time. If you don't need to search on all of these fields, it is recommended that you create a custom mapping, and only index the fields you plan to search.

Second, you appear to have many numeric fields. In bleve today numeric fields are very expensive to index, as they are optimized for later doing numeric range searches. But, this optimization means that numeric fields can take up to 16x the space of text field with a single term. This is something we hope to improve in the future, but for now it means you have to be very selective about including numeric fields. Having lots of numeric fields means the index will be quite large (and consequently slow).

Finally, boltdb is the default storage for bleve because it is easy to use and get started. But, it is not the best choice for indexing performance. Choosing one of the alternate key/value stores can offer significantly better indexing performance (usually with some trade-off on search performance).

One more thing, using batches is recommended, but I would be careful about doing the entire workload in a single batch. Typically choosing a batch size of say 100 or 1000 documents works best to efficiently get work done and make incremental progress.

hetao29 commented 6 years ago

Thanks you very much. I'll try later.

gadelkareem commented 6 years ago

Any docs on the storage types?

MakDon commented 2 years ago

In bleve today numeric fields are very expensive to index, as they are optimized for later doing numeric range searches.

Hi @mschoch , thanks for your suggestion. As numeric fields are optimized for later doing numeric range searches, if I would never do range searches on the numeric field, is there a way to skip these optimizations for range searches to improve indexing speed? Thanks very much.

iredmail commented 2 years ago

As numeric fields are optimized for later doing numeric range searches, if I would never do range searches on the numeric field, is there a way to skip these optimizations for range searches to improve indexing speed?

How about create your own mapping (instead of using the default one) and define it as a string?

MakDon commented 2 years ago

How about create your own mapping (instead of using the default one) and define it as a string?

It seems to be a good idea. I would have a try on it. Thanks very much.