cliqz-oss / keyvi

Keyvi - a key value index that powers Cliqz search engine. It is an in-memory FST-based data structure highly optimized for size and lookup performance.
https://cliqz.com
Apache License 2.0
179 stars 38 forks source link

bugfix: Sigsev when trying to merge .kv file with no data #224

Closed narekgharibyan closed 7 years ago

narekgharibyan commented 7 years ago

@hendrikmuhs Fixes #211

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.01%) to 86.709% when pulling a6b870e317e5bfc32493787fd39df82de7cdc510 on narekgharibyan:merge_crush_fix into 33f83e93969fe593bca09b202f727f88a81ab3f7 on cliqz-oss:master.

hendrikmuhs commented 7 years ago

I like the added Empty() interface.

The fix looks good, it's probably a good idea to prevent adding it early on. Still, that hides a bit the real problem which is the unchecked push here:

https://github.com/cliqz-oss/keyvi/blob/master/keyvi/src/cpp/dictionary/dictionary_merger.h#L153

As in https://github.com/cliqz-oss/keyvi/blob/master/keyvi/src/cpp/dictionary/dictionary_merger.h#L199

we should only push SegmentIterator if it isn't exhausted. Although it is not necessary with the fix, I suggest to either document it there, or fix it to be on the safe side.

narekgharibyan commented 7 years ago

Hmmm,

In the second case I think we don't have a problem at all: https://github.com/cliqz-oss/keyvi/blob/master/keyvi/src/cpp/dictionary/dictionary_merger.h#L199 The if check makes sure that we push only valid iterator.

For the first case, I assume we are guaranteed that our iterator is valid (non-empty automata can not return empty iterator). But we can add an assert here as an internal integrity check.

hendrikmuhs commented 7 years ago

Maybe, I described it wrong:

I am asking for doing at line 153 the same as at line 199:

SegmentIterator segment_it(e_it, i++);
if (segment_it) {
    pqueue.push(segment_it);
}
narekgharibyan commented 7 years ago

What I was suggesting is to have an assert check:

SegmentIterator segment_it(e_it, i++);
assert(segment_it);
pqueue.push(segment_it);

According to our semantics we can not have a case when it is false, yes ? (after empty check)

[Update] Also I'm not sure what we have to do with other fields (for instance inputFiles_) in case we skip some segment.

narekgharibyan commented 7 years ago

BTW, any news about your Skype ? :)

hendrikmuhs commented 7 years ago

I think an empty iterator is still a valid one, as a dictionary without a key is valid as well. Therefore I think assert isn't the right thing here. You are right that your empty check already prevents that case, so we are just talking about correctness/code sugar.

narekgharibyan commented 7 years ago

@hendrikmuhs please let me know what you think about the last commit.

Also I wanted to add a check, which prevents adding the same file multiple times:

    void Add(const std::string& filename) {

        if (std::count(inputFiles_.begin(), inputFiles_.end(), filename)) {
            throw std::invalid_argument("File is added already");
        }
     ...
    }

But realized that some of unit tests already do that, like here: https://github.com/cliqz-oss/keyvi/blob/master/keyvi/tests/cpp/dictionary/dictionary_merger_test.cpp#L171 Is it intentional ?

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.004%) to 86.702% when pulling 749c4dc6712388fd66d9fdab9ecb8453ca5db641 on narekgharibyan:merge_crush_fix into 33f83e93969fe593bca09b202f727f88a81ab3f7 on cliqz-oss:master.

hendrikmuhs commented 7 years ago

LGTM!, That moves error handling where it should be, meaning if you give it a non-existing or corrupt file it fails in Add() not in Merge(). Good idea!

As for the test: Not intentional, I think the idea was just to overwrite a key with a new value and then overwrite it back to the original state. So to fix it you can simply create another dictionary based on test_data.

hendrikmuhs commented 7 years ago

LGTM

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.01%) to 86.709% when pulling 193f04336e7f28ac0cff4dbbab812a4b740363f9 on narekgharibyan:merge_crush_fix into 33f83e93969fe593bca09b202f727f88a81ab3f7 on cliqz-oss:master.