Jiacheng-WU / lipp

Updatable Learned Index with Precise Positions
MIT License
54 stars 19 forks source link

Segmentation Fault when inserting keys #5

Closed zhczhong closed 3 years ago

zhczhong commented 3 years ago

Hi Jiacheng,

We want to test the distribution shift of different learned index in SOSD. But when I run lipp in develop branch, it will throw segmentation fault. And I have already removed duplicated keys. The code below could be used to reproduce the error. And the dataset could be fetching by executing the script from https://github.com/learnedsystems/SOSD/blob/master/scripts/download.sh. Would you mind taking a look and fix it? Besides, when is the estimated time for the range query interface?

#include <lipp.h>
#include <iostream>
#include <vector>
#include <cassert>
#include <algorithm>
#include <iostream>
#include <fstream>

using namespace std;

template<class T>
bool load_binary_data(T data[], int length, const std::string &file_path) {
    std::ifstream is(file_path.c_str(), std::ios::binary | std::ios::in);
    if (!is.is_open()) {
        return false;
    }
    is.read(reinterpret_cast<char *>(data), std::streamsize(length * sizeof(T)));
    is.close();

    return true;
}

template<typename T>
T *unique_data(T *key1, size_t &size1, T *key2, size_t &size2) {
    size_t ptr1 = 0;
    size_t ptr2 = 0;

    std::sort(key1, key1 + size1);
    size1 = std::unique(key1, key1 + size1) - key1;
    std::sort(key2, key2 + size2);
    size2 = std::unique(key2, key2 + size2) - key2;

    size_t result = 0;

    while (ptr1 < size1 && ptr2 < size2) {
        while (key1[ptr1] < key2[ptr2]) {
            ptr1++;
        }
        if (key1[ptr1] == key2[ptr2]) {
            ptr2++;
            continue;
        }
        key2[result++] = key2[ptr2++];
    }

    while (ptr2 < size2) {
        key2[result++] = key2[ptr2++];
    }

    size2 = result;
    std::random_shuffle(key2, key2 + size2);

    return &key2[result];
}

int main() {
    LIPP <uint64_t, uint64_t> lipp;
    size_t data_size = 200000000;
    size_t bulk_load_size = 100000000;
    size_t to_insert = data_size - bulk_load_size;

    // prepare data
    uint64_t *keys = new uint64_t[data_size];
    cout << "Loading data" << endl;
    load_binary_data(keys, bulk_load_size, "../fb_200M_uint64");
    load_binary_data(keys + bulk_load_size, to_insert, "../books_200M_uint64");
    cout << "Make data Unique" << endl;
    unique_data(keys, bulk_load_size, keys + bulk_load_size, to_insert);

    vector <pair<uint64_t, uint64_t>> data;
    cout << "bulk load size is " << bulk_load_size << endl;
    for (size_t i = 0; i < bulk_load_size; i++) {
        data.push_back({keys[i], keys[i]});
    }

    cout << "Begin to bulkload" << endl;

    // bulk load
    lipp.bulk_load(data.data(), bulk_load_size);

    cout << "Finish bulkload" << endl;
    cout << "Begin insert" << endl;
    for (int i = 0; i < to_insert; i++) {
        lipp.insert(keys[bulk_load_size + i], i);
    }

    cout << "Finish insert" << endl;

    return 0;
}

The gdb backtrace info is as followed

#0  0x0000000000411100 in LIPP<unsigned long, unsigned long, true>::insert_tree (this=0x7fffffff3140, _node=0xb5d098e0, key=@0x7fffc75b1a58: 772277387847372864, 
    value=@0x7fffffff3238: 73) at lipp/src/core/lipp.h:795
#1  0x000000000041015c in LIPP<unsigned long, unsigned long, true>::insert (this=0x7fffffff3140, key=@0x7fffc75b1a58: 772277387847372864, value=@0x7fffffff3238: 73)
    at lipp/src/core/lipp.h:110
#2  0x000000000040f55f in main () atlipp/src/examples/example_bulk_load.cpp:85
Jiacheng-WU commented 3 years ago

It might be a bug for your codes. In fact, you have removed the duplicated keys. However, the variable "bulk_load_size" is changed after "unique_data" since it passed by reference. and became the first position of the duplicated keys after "unique". But you put the duplicated inserted keys into "keys + bulk_load_size" here bulk_load_size is the previous one, rather than the changed one, right? Thus, when you use the following code to insert for (int i = 0; i < to_insert; i++) { lipp.insert(keys[bulk_load_size + i], i); } you could find you still insert duplicated keys into our indexes, which cause severe problems, you may need different variable to not make the bulk_load_size confuse. So, you'd better to check you code after encoutering the fault! Please be careful when using C++! HAHA!

As for the range query, please be patient. Now we have a holiday - National Day, and we will try out best to publish the code for range query before the end of October!

Sincerely, Jiacheng

zhczhong commented 3 years ago

Hi, Jiacheng,

Thanks for your reply! I have replaced the bulk_load_size with different variable, but the problem still exists. And in fact, there is no duplicate key in the first dataset, so the bulk_load_size would not be changed. The following is the modified code version and the corresponding output. Would you mind taking a look?

Besides, we are rushing a deadline by the end of October, could it have a chance to be earlier for the range query interface?

#include <lipp.h>
#include <iostream>
#include <vector>
#include <cassert>
#include <algorithm>
#include <iostream>
#include <fstream>

using namespace std;

template<class T>
bool load_binary_data(T data[], int length, const std::string &file_path) {
    std::ifstream is(file_path.c_str(), std::ios::binary | std::ios::in);
    if (!is.is_open()) {
        return false;
    }
    is.read(reinterpret_cast<char *>(data), std::streamsize(length * sizeof(T)));
    is.close();

    return true;
}

template<typename T>
T *unique_data(T *key1, size_t &size1, T *key2, size_t &size2) {
    size_t ptr1 = 0;
    size_t ptr2 = 0;

    std::sort(key1, key1 + size1);
    size1 = std::unique(key1, key1 + size1) - key1;
    std::sort(key2, key2 + size2);
    size2 = std::unique(key2, key2 + size2) - key2;

    size_t result = 0;

    while (ptr1 < size1 && ptr2 < size2) {
        while (key1[ptr1] < key2[ptr2]) {
            ptr1++;
        }
        if (key1[ptr1] == key2[ptr2]) {
            ptr2++;
            continue;
        }
        key2[result++] = key2[ptr2++];
    }

    while (ptr2 < size2) {
        key2[result++] = key2[ptr2++];
    }

    size2 = result;
    std::random_shuffle(key2, key2 + size2);

    return &key2[result];
}

int main() {
    LIPP <uint64_t, uint64_t> lipp;
    size_t data_size = 200000000;
    size_t bulk_load_size = 100000000;
    size_t to_insert = data_size - bulk_load_size;
    size_t unique_bulk_load_size = bulk_load_size;

    // prepare data
    uint64_t *keys = new uint64_t[data_size];
    cout << "Loading data" << endl;
    load_binary_data(keys, bulk_load_size, "../fb_200M_uint64");
    load_binary_data(keys + bulk_load_size, to_insert, "../books_200M_uint64");
    cout << "Make data Unique" << endl;
    unique_data(keys, unique_bulk_load_size, keys + bulk_load_size, to_insert);

    vector <pair<uint64_t, uint64_t>> data;
    cout << "bulk load size is " << unique_bulk_load_size << endl;
    cout << "The number of keys to insert is " << to_insert << endl;
    for (size_t i = 0; i < unique_bulk_load_size; i++) {
        data.push_back({keys[i], keys[i]});
    }

    cout << "Begin to bulkload" << endl;

    // bulk load
    lipp.bulk_load(data.data(), unique_bulk_load_size);

    cout << "Finish bulkload" << endl;
    cout << "Begin insert" << endl;
    for (int i = 0; i < to_insert; i++) {
        lipp.insert(keys[bulk_load_size + i], i);
    }

    cout << "Finish insert" << endl;

    return 0;
}
Loading data
Make data Unique
bulk load size is 100000000
The number of keys to insert is 99999997
Begin to bulkload
Finish bulkload
Begin insert
Segmentation fault (core dumped)
Jiacheng-WU commented 3 years ago

It seems really wired. Unfortunately, we guess we have never tested our codes on those provided datasets before. Besides, the dataset accessed pattern (i.e. distribution shift) is also a little bit out of our scope. Thus, we have never encountered those bugs and do not have any experience to resolve this. After some primitive debugging, it seems we have reuse/allocate some nodes not correctly in your extreme cases. But you know, those kinds of bugs are really hard to deal with. Therefore, we will try our best to solve this issue, but we cannot guarantee whether we could fix it and when we would fix it. You could try to resolve them by yourself since it seems that you are hushing to some deadlines.

As for the range query interface, since we are still on the vacation and we also have some deadlines to meet, we should apologize that we cannot make any promises about the "earlier time" to you. But we will try our best to provide the related codes once we are not so busy. I hope to get your understanding!

Sincerely, Jiacheng

zhczhong commented 3 years ago

Hi Jiacheng, sorry to bother you in your vacation. Thanks for your reply and the information provided. I will still keep an eye on your repository and try to resolve it by ourselves.

For the range query interface, I noticed that your paper had an experiment about it. I guessed you have already implemented it before but may not clean up the code yet. Could you share the original code with us so that we could do it by ourselves based on that?

Regards, Zhicong

Jiacheng-WU commented 3 years ago

Hi Zhicong,

Thanks for your understanding and possibly future contribution if you resolve those issues. HAHA Besides, I have also sent an email to you about the range query interface.

Sincerely, Jiacheng