NeowayLabs / neosearch

Full Text Search Library
30 stars 4 forks source link

OOM crash when indexing #37

Open i4ki opened 9 years ago

i4ki commented 9 years ago

The current implementation of MergeSet operation has a know bug that causes an out-of-memory.

https://github.com/NeowayLabs/neosearch/blob/master/lib/neosearch/store/store_leveldb.go#L143

The LevelDB driver throw an OOM when trying to GET a key that has a value bigger than the available memory. It's a very common problem because NeoSearch doesn't support any stopwords analyzer yet, then the reverse-index increases very fast for very frequent words like "a", "the", "has", "that", etc.

An idea to solve that problem is split the keys in fixed-size segments (e.g 1GB) but then we have to discuss the way we will manage that segments.

Any ideas?

rzanato commented 9 years ago

The current implementation requires all keys to be loaded into one continuos memory array. This simple approach is acceptable but not desirable. Spliting the key is the only solution I can think of at the moment.

Any method to split keys will create an unbalanced distribution, requiring some kind of database maintenance or expensive delete/insert operations.

The simplest implementation is a sequence of segments, with a maximum size like 1GB. When reading, if one segment has exactly the maximum size it is required to read the next segment with a sequencial key reference name.

An interface should be created to abstract the key spliting approach, allowing future improvements.

Despite this issue, the Go LevelDB driver copies all data fetched from LevelDB with C.GoBytes, the LevelDB C API duplicates the same data from a std::string into a new *char with memcpy, the neosearch lib translates []bytes to/from []uint64 to preserve the endianness and MergeSet requires a new buffer with the same size. Where appropriate, consider restricting the endianness using a database configuration and avoid duplication by implementing a C++ interface to LevelDB using Go reflection to refer the LevelDB allocated memory.

A tokenization interface may be provided to the analyzer, with support to stopwords. Sample sources:

i4ki commented 9 years ago

Yeah, agreed!

The segmentation of keys solves the problem of OOM in current implementations of leveldb, rocksdb and others, but will increase the complexity of indexing documents and impact performance of queries depending on implementation choice.

Besides key's segmentation and analyzers (stopwords) we need support for data sharding. This three improvements will decrease our problem with large chunks of data in memory considerably.

@rzanato In some implementations we can avoid database management on indexing/deleting. One way to achieve that is avoid to maintain the segments with fixed-size and accept gaps in some of them.

Take a look at the implementation idea below:

Maximum sized segments (data unordered)

Fast indexing, slow deletes, query-time directly proportional of number of segments in the worst case (cache miss)

We can index keys with sequential indices using the format <word>.<segment-idx> as keys but keeping the values unordered, like below:

key=<word>.1, value=<maximum of 1GB of ids> key=<word>.2, value=<maximum of 1GB of ids> ... and key=<word>.__next, value=<next-empty-segment-id>

This way the inserts of new IDs will occurs always in the last <segment-idx>, very fast, but as values will be "unordered" the deletes will take a full scan on all segments. For queries, we'll need to iterate on entire dataset (all segments), but this way we need N logical seeks on disk (or cache hits) instead of 1 (in the worst case) where N == <segment-size>. I don't think this is a problem, we need to benchmark some use cases, because 1 (one) seek with a very large read into memory can be painful in very often cache miss. Every operation will need the __next value, that can be kept in memory or retrieved every time (we can think of a better alternative).

Deletes will leave the segments with gaps (I think preferred way) or will involve more complex management like merging keys or moving data from last segment to segments that had data deleted, etc.

What do you think?

ppizarro commented 9 years ago

The problem that we would like to resolve here at Neoway, we need at most of time to find by exact value in a field. Most of the fields are of type integer, boolean, date, double and string (not analyze). We need to calculate a huge amount of aggregations too. I don't know if reverse index is the best solution to resolve this. How can we search by a range of dates using reverse index? I think we must design how we will resolve this first. Maybe to store data unordered is not the best strategy for this, although it could be good for full text search.

i4ki commented 9 years ago

I think you don't understand about the "data unordered". The keys are always lexicografically ordered and suited for range queries with all data types. In case of numbers (integer, floats, and so on) and date (integer) we index the keys with big endian format, see below:

For Indexing the values 1, 2, 1000, we'll obtain the following keys lexicografically ordered:

01 00 00 00 00 00 00 00 11 00 00 00 00 00 00 00 11 01 01 00 00 00 00 11

And we can use the leveldb/rocksdb Iterator for get all of the document values for the key range 1-1000. The code will be something like that:

func (i *Index) rangeInteger(field string, from int, to int) ([]uint64, error) {
    var (
        docIDs []uint64
    )

    storekv, err := i.engine.GetStore(i.Name, utils.FieldNorm(field)+"_int.idx")

    if err != nil {
        return nil, err
    }

    it := storekv.GetIterator()

    defer it.Close()

        fromBytes := utils.IntegerToBytes(from)
        toBytes := utils.IntegerToBytes(to)

    for it.Seek(fromBytes); it.Valid(); it.Next() {
                var ids []uint64
                cur := utils.BytesToInteger(it.Key())

                if cur < from || cur >= to {
                        break
                }

                dataBytes := it.Value()

        if len(dataBytes) == 0 {
            continue
        }

        for i := 0; i < len(dataBytes); i += 8 {
            v := utils.BytesToUint64(dataBytes[i : i+8])
            ids = append(ids, v)
        }

        if len(docIDs) == 0 {
            docIDs = ids
            continue
        }

        for _, id := range ids {
            docIDs = utils.UniqueUint64Add(docIDs, id)
        }
    }

    if err := it.GetError(); err != nil {
        return nil, err
    }

    return docIDs, nil

The code above isn't tested, but should work because I'd made some experiments with leveldb before decide to use integers stored in big endian. Of course we can store integers with little-endian format too, but for use the Iterator we'll need to write a comparator function to replace the leveldb default (lexicography). The example of implementation above does not use any "hacks" for performance. I think that it's easy to improve it in the future. See the lucene approaches below: http://chase-seibert.github.io/blog/2009/09/11/numeric-range-searches-in-lucene.html http://pt.slideshare.net/VadimKirilchuk/numeric-rangequeries The current implementation of Lucene is called TrieRangeQuery and based on the following paper: http://core.ac.uk/download/pdf/11761587.pdf

But I agree with you, I'm not sure if a reverse index is the solution that we seek and for this exact thinking I'd created the issue #36 =D

ppizarro commented 9 years ago

OK, I get it. I confused "data unordered" with the order of the keys. I am a stupid! I'm going to read the articles that you sent me.

i4ki commented 9 years ago

No... That was my fault! I should make it clear.

=)

ppizarro commented 9 years ago

@tiago4orion The segmentation of keys solves the problem of OOM in current implementations of leveldb, rocksdb and others, but will increase the complexity of indexing documents and impact performance of queries depending on implementation choice.

Another idea should be to append the document ID on the key.

Using word and document ID as Key

Instead of having a unique key for each word and associated with it the list of documentIDs:

key=<word>, value=[ID1, ID5, ID7]

We could have a key for each documentID:

key=<word>.<ID1>, value=1
key=<word>..<ID5>, value=1
...
key=<word>..<documentIDn>, value=1

So we'll use the storage (LSM, btree, fractal, ...) to order the IDs, to add new one and to remove.

For queries, we'll need to iterate through the keys to get the IDs, but only the number of times needed to return to the query, since they (keys/IDs) are already ordered.

Fast indexing, fast deletes and query-time directly proportional of number of keys in cache (stables)

What do you think?

i4ki commented 9 years ago

@ppizarro I don't have enought experience with LSM-tree to judge what alternative is better, then I propose implement both of them and benchmark performance for some use-cases.

=)

But... Thinking about your solution yesterday, my current guess is that it's not a good approach giving my LSM current knowledge (beginner). It is very optimistic for my taste. but we need implement and benchmark some use cases.

Doing a simple analysis, I think we can get quadratic time for some queries (the worst case) using this approach. The MemTable (or L0) data structure is a skip list, meaning that for your solution the cost for get all of the document's ID (size represented as "n") of a given query (if the entire data fits in memory and is already cached) will be O(n * log n) in the average and n * O(n) or O(n²) in the worst case... With cache miss, it will be necessary to scan every subtree in disk, then this is a very optimistic analysis, ignoring any disk seek.

Average: O(n * log n) Worst: O(n²)

Then, we'll have something between logarithmic and quadratic time.

Using the other idea, segmentation of values, and maintaining the idea of reverse index, it will be necessary O(k * log n) for get the entire ID dataset unordered where k is the number of segments, plus something like O(n * log n) for sorting (eg.: mergesort) in the average case and O(k * n) + O(n * log n) in the worst case.

Average: O(k * log n) + O(n * log n) OR O((k+n) * log n) Worst: O(k * n) + O(n * log n)

Then, this way we can avoid completely quadratic time queries based in n (index size).

Keeping in mind that:

k = number_of_segments = 1 + (n/segment_size)

Then, to index 100,000,000 documents with segment size of 1GB we will need only 1 segment, for segment size of 512MB only 2 segments, and so on, and for segment size of 1MB will be necessary 763 segments.

segment size 1GB = O((n+1) * log n) segment size 512MB = O((n+2) * log n)

So, of course, k grows much slower than n.

In addition, we can use any sorting algorithm optimized for positive integers (maybe radix sort), but I think the merge sort fits perfectly because of nature of our already segmented data. We can implement a specialized mergesort algorithm, that sort the segments in parallel as soon as obtained by Store.Get(), and merging the sorted sets in the end.

But I understand the other benefits of your idea, because this way we can store document metadata informations like (term frequencies, highlight position in source document, and so on). Then, I encourage we implement both of them and analyse the benchmarks vs tradeoffs for each implementation.

Next weekend I'll have time and maybe I'll start working on it,

Thanks @richard-ps for pointing a mistake related to exponential and quadratic complexity in my answer.

ppizarro commented 9 years ago

Great analysis. I strongly agree with you.

Thinking about the solution using document's ID on the key, maybe we could change LSM (leveldb) for B+Tree (boltdb). A lookup in B+Tree is O(log n) in the average and in the worst case. So, a simple query will be n * O(log n).

Average: O(n * log n) Worst: O(n * log n)

Then, we'll have logarithmic time and no more quadratic time.

Another point I would like to discuss will be the "merge" of two or more results.

For example to solve the following query: age == 5 and name == Jhon, we'll have to perform the intersection of the list of document's ID with age == 5 with the list of document's ID with name Jhon.

We won't be able to load these two lists in memory to sort them and then to perform the merge. So I think we'll have to get ID by ID and verify that it exist on another list.

For the solution using document's ID on the key we iterate through each list once.

Average: O(n * log n) + O(n * log n) = O(2 * n * log n) Worst: O(n * log n) + O(n * log n) = O(2 * n * log n)

Yet we would have logarithm time.

For the solution using segments we need iterate one list once but another we need iterate n times in the worst case.

Average: O((k+n) * log n) + n * O((k+n) * log n) => O(n * log n) + O(n² * log n) Worst: O(k * n) + O(n * log n) + n * (O(k * n) + O(n * log n)) => O(n²) + O(n² * log n)

Then we'll have quadratic time for merge results.

Can we do better this merge?

i4ki commented 9 years ago

I'm taking a look in BoltDB right now, apparently it supports a lot of interesting features useful for neosearch. I'm starting to develop the boltdb support in the project to try this new ideas, because leveldb/rocksdb will not perform well this way (sequential access by key).

BoltDB provides high performance for sequential reads and writes (which is good for the case when document's ID is on the key), but poor performance for random writes. I don't know yet about random reads. It uses mmap for manage the database file, then the performance is very dependable of page cached (4KB) operations.

I'm skeptical only because of bleve documentation at the link below: http://www.blevesearch.com/docs/Building/

Bleve supports using LevelDB as a KVStore implementation. At this time, this is the fastest KVStore implementation available for bleve. So if maximum performance is your goal, consider building with support for LevelDB.

And according to what you said to me before, Bleve uses this same approach for store keys, but boltdb does not perform well (as said in the link above). It's a bit strange, maybe implementation issues.

About the "merge" or "intersection" of lists, I didn't understand what you said. The current "intersection" algorithm implemented in neosearch (see the link below) is called Zipper and have complexity of O(n + m) where n and m is the size of each list. https://github.com/NeowayLabs/neosearch/blob/master/lib/neosearch/search/search.go#L75

It's the most straightforward implementation, exists tons of others[1][2][3][4][5] that we can use, but the complexity of intersection is independent of storage implementation (LSM, B+tree, etc). The idea of data unordered for indices (values) was a response to @rzanato warning about database maintenance needed for delete/inserts for ordered segments. But if we conclude that indexing time and maintanance is not critical, we can store the values sorted (that's the way is done today).

The point that I think that we shine is in the use of reverse index as an array of sorted integers. This very simple approach gives us a human-friendly database. The bleve way turns impossible manual interaction like nscli and nsdump.

But.. we need to solve the problem. I'll create a pull request today with a boltdb backend implementation. After that, we can start benchmark things with a B+tree behind the scene for comparisons.

[1] http://algo2.iti.kit.edu/documents/Algorithmentechnik/alenex07_2.pdf [2] http://www.vldb.org/pvldb/2/vldb09-pvldb37.pdf [3] http://research.microsoft.com/pubs/173795/vldb11intersection.pdf [4] http://www.cse.buffalo.edu/tech-reports/2007-11.pdf [5] https://highlyscalable.wordpress.com/2012/06/05/fast-intersection-sorted-lists-sse/

ppizarro commented 9 years ago

About the "merge" or "intersection" of lists the algorithm is not the problem. The current "intersection" implementation (https://github.com/NeowayLabs/neosearch/blob/master/lib/neosearch/search/search.go#L75) is exactly the problem. How will you call the "and" function if you can't have the lists loaded in RAM?

I understand that you must implemented the Zipper algorithm directly iterating the key in the store and not in the RAM.

So I think using "document's ID in the key" the complexity will be exponential (to perform the merge) and using "segments of IDs unordered" the complexity will be quadratic because the segments doesn't save the document's ID ordered. That's what I tried to explain previously but I guess I wasn't clear.

i4ki commented 9 years ago

OK, now I understand what you mean.

How will you call the "and" function if you can't have the lists loaded in RAM? If we have segmented data for values, then I can apply the "and" function incrementally between the segment ids (lower than segment_size) and the result list while the result list is lower than available memory.

I think that talking with code we can understand each other a lot better :)

func Search(index *index.Index, dsl map[string]string, limit uint) ([]string, uint64, error) {
    var (
        resultDocIDs  []uint64
    )

    for field, value := range dsl {
        // index.FilterSegments returns an iterator used to sequentially get the
        // ids of each segment (already sorted)
        docIterator, err := index.FilterSegments([]byte(field), []byte(value))

        if err != nil {
            return nil, 0, err
        }

        for segmentIDs := docIterator.First(); docIterator.Valid(); docIterator.Next() {
        // segmentIDs is an []uint64 with lenght less than (or equal to) segment_size
            if len(resultDocIDs) == 0 {
                resultDocIDs = segmentIDs
                continue
            }

            // this can led to out of memory eventually 
             resultDocIDs = and(resultDocIDs, segmentIDs)
        }
    }

    results, err := ind.GetDocs(resultDocIDs, limit)
    return results, uint64(len(resultDocIDs)), err
}

This is the fastest way I can think of, because we work directly with the array of bytes.

The other way, using the store (leveldb or bolt) for get every id needed for comparison, besides very slow, we still have the same problem of the result list of ids greater than available RAM isn't?

ppizarro commented 9 years ago

That's it - quadratic complexity!

k (# segments) * (sort + merge)

O(merge) = m (# doc's ID in the segment)

For big sizes k ~= n (total de document's ID) On Oct 5, 2015 12:34 AM, "Tiago Natel de Moura" notifications@github.com wrote:

OK, now I understand what you mean.

How will you call the "and" function if you can't have the lists loaded in RAM? If we have segmented data for values, then I can apply the "and" function incrementally between the segment ids (lower than segment_size) and the result list while the result list is lower than available memory.

I think that talking with code we can understand each other a lot better :)

func Search(index *index.Index, dsl map[string]string, limit uint) ([]string, uint64, error) { var ( resultDocIDs []uint64 )

for field, value := range dsl {
    // index.FilterSegments returns an iterator used to sequentially get the
    // ids of each segment (already sorted)
    docIterator, err := index.FilterSegments([]byte(field), []byte(value))

    if err != nil {
        return nil, 0, err
    }

    for segmentIDs := docIterator.First(); docIterator.Valid(); docIterator.Next() {
    // segmentIDs is an []uint64 with lenght less than (or equal to) segment_size
        if len(resultDocIDs) == 0 {
            resultDocIDs = segmentIDs
            continue
        }

        // this can led to out of memory eventually
         resultDocIDs = and(resultDocIDs, segmentIDs)
    }
}

results, err := ind.GetDocs(resultDocIDs, limit)
return results, uint64(len(resultDocIDs)), err

}

This is the fastest way I can think of, because we work directly with the array of bytes.

The other way, using the store (leveldb or bolt) for get every id needed for comparison, besides very slow, we still have the same problem of the result list of ids greater than available RAM isn't?

— Reply to this email directly or view it on GitHub https://github.com/NeowayLabs/neosearch/issues/37#issuecomment-145420631 .

ppizarro commented 9 years ago

Maybe linear complexity!

O(merge) and O(sort) could be considered constants for big n, always their size = m.

ppizarro commented 9 years ago

To resolve the query: "field1" == "result1" AND "field2" == "result2", we need firstly get the list of document's ID for each field.

list1 = list of document's ID who has field1 == result1 list2 = list of document's ID who has field2 == result2

Each list instead will be a list of segments:

list1 = list of segments for field1 list2 = list of segments for field2

Now, we need perform the merge of this two list.

In your code above you have one list of segments. Is this true? I don't understand this :(

i4ki commented 9 years ago

Sorry for the delay.

Thanks for the detailed look. The code I put before is wrong =( but the idea is quite the same.

The correct pseudo code (I think) is the below:

func Search(index *index.Index, dsl map[string]string, limit uint) ([]string, uint64, error) {
    var (
        resultDocIDs  []uint64
    )

    for field, value := range dsl {
        // index.FilterSegments returns an iterator used to sequentially get the
        // ids of each segment (already sorted)
        docIterator, err := index.FilterSegments([]byte(field), []byte(value))

        if err != nil {
            return nil, 0, err
        }

        if len(resultDocIDs) == 0 {
                resultDocIDs = getAllSegments(docIterator)
                continue
        }

        for segmentIDs := docIterator.First(); docIterator.Valid(); docIterator.Next() {
        // segmentIDs is an []uint64 with lenght less than (or equal to) segment_size

             // this "and" can led to out of memory eventually 
             resultDocIDs = and(resultDocIDs, segmentIDs)
        }
    }

    results, err := ind.GetDocs(resultDocIDs, limit)
    return results, uint64(len(resultDocIDs)), err
}

You can use with code below:

search(index, map[string]string{
                          "name": "jonh",
                          "lastname": "carter",
                      }, 10)

The function getAllSegments will iterate the docIterator and get the all of the ids, but only for the first filter. The subsequent filters will apply the "and" function between the resultDocIDs and each segments data. Then, the implementation above has space complexity directly proportional to the number of integers returned by first query (name = 'john').

This way, it's impossible to get an OOM at indexing time (original issue), but it's still possible to get an OOM in the call to function getAllSegments (or something like that).

We can avoid load the entire first filter IDs in memory re-designing the way of store segments, using fixed-size bitset for segment values. Then we can apply "and" and "or" operations only between 2 (two) segments loaded in memory, and appending the results to an output list (this output list can grow indefinitely too, but now the problem is in neosearch controlled code, we can avoid crash). But we can have other tradeoffs that I'm not yet aware this way...

Using the leveldb (skiplist) or boltdb b+tree directly to execute the intersection function we can avoid any kind of OOM?