Jiacheng-WU / lipp

Updatable Learned Index with Precise Positions
MIT License
54 stars 19 forks source link

Segmentation Fault in bulk loading #2

Closed zhczhong closed 3 years ago

zhczhong commented 3 years ago

Hi,

There will appear a segmentation fault when bulk loading osm_cellids_200M_uint64 dataset in SOSD. The code below could be used to reproduce the error. And the dataset could be fetching by executing the script from https://github.com/learnedsystems/SOSD/blob/master/scripts/download.sh.

#include <lipp.h>
#include <iostream>
#include <vector>
#include <cassert>
#include <algorithm>
#include <iostream>
#include <fstream>

using namespace std;

template<class T>
bool load_binary_data(T data[], int length, const std::string &file_path) {
    std::ifstream is(file_path.c_str(), std::ios::binary | std::ios::in);
    if (!is.is_open()) {
        return false;
    }
    is.read(reinterpret_cast<char *>(data), std::streamsize(length * sizeof(T)));
    is.close();

    return true;
}

int main()
{
    LIPP<uint64_t, uint64_t> lipp;
    size_t data_size = 100000000;

    // prepare data
    uint64_t *keys = new uint64_t[data_size];
    load_binary_data(keys, data_size, "./osm_cellids_200M_uint64");
    vector<pair<uint64_t, uint64_t>> data;
    std::sort(keys, keys + data_size);
    for (size_t i = 0; i < data_size; i ++) {
        data.push_back({keys[i], keys[i]});
    }

    // bulk load
    lipp.bulk_load(data.data(), data.size());

    // normal insert
    lipp.insert(-100, 5);
    lipp.insert(187688, 4);

    // verify LIPP data structure
    lipp.verify();

    cout << "bolk load success" << endl;

    return 0;
}

Output

Program received signal SIGSEGV, Segmentation fault.
0x00000000004048d1 in LIPP<unsigned long, unsigned long, true>::build_tree_bulk_fmcd(unsigned long*, unsigned long*, int) ()
Jiacheng-WU commented 3 years ago

Oh, it might be caused by memory issues? Have you ever tried the newest version of our codes?

Jiacheng-WU commented 3 years ago

Besides, would you mind please running the codes using Debug Mode, and put the coredump information if possible?

zhczhong commented 3 years ago

Hi, Jiacheng

Thanks for your quick reply. I confirm that I use the latest code(BTW, I use gcc(version 8.3.0) to compile your code). And the debug information in gdb is as below.

Program received signal SIGSEGV, Segmentation fault.
0x00000000004048d1 in LIPP<unsigned long, unsigned long, true>::build_tree_two (value2=5170311421649302915, key2=5170311421649302915, value1=5170311421649302705, 
    key1=5170311421649302705, this=0x7fffffff55a0) at code/lipp/src/core/lipp.h:409
--Type <RET> for more, q to quit, c to continue without paging--
409                 RT_ASSERT(BITMAP_GET(node->none_bitmap, pos) == 1);
(gdb) backtrace
#0  0x00000000004048d1 in LIPP<unsigned long, unsigned long, true>::build_tree_two (value2=5170311421649302915, key2=5170311421649302915, 
    value1=5170311421649302705, key1=5170311421649302705, this=0x7fffffff55a0) at code/lipp/src/core/lipp.h:409
#1  LIPP<unsigned long, unsigned long, true>::build_tree_bulk_fmcd (this=this@entry=0x7fffffff55a0, _keys=_keys@entry=0x7fff97ac1010, 
    _values=_values@entry=0x7fff67fd0010, _size=_size@entry=100000000) at code/lipp/src/core/lipp.h:565
#2  0x0000000000405ecd in LIPP<unsigned long, unsigned long, true>::build_tree_bulk (_size=100000000, _values=0x7fff67fd0010, _keys=0x7fff97ac1010, 
    this=0x7fffffff55a0) at code/lipp/src/core/lipp.h:425
#3  LIPP<unsigned long, unsigned long, true>::bulk_load (this=0x7fffffff55a0, vs=0x7ffec95ab010, num_keys=100000000)
    at code/lipp/src/core/lipp.h:175
#4  0x0000000000402ca5 in main () at code/lipp/src/examples/example_bulk_load.cpp:38
Jiacheng-WU commented 3 years ago

Thanks, I also have reproduced this case on our computers.

Actually, in the provided datasets with larger keys, the parameters of models become NaN, which could be caused by the zero-division. This issue might be resulted from the precision of double types, especially on the dataset where keys are too close after converting to double.

Besides, we have encountered similar situations at ALEX previously, but the first time at LIPP. But I think we have the experience to fix this issue.

We plan to fix this issue in one week, so please be patient. You could star or watch this repo to keep up with our codes.

Sincerely

Jiacheng-WU commented 3 years ago

Sorry to close this issue accidentally.

zhczhong commented 3 years ago

No problem. Looking forward to the update!

zhczhong commented 3 years ago

May I ask when is the issue estimated fix time? Thanks in advance!

Jiacheng-WU commented 3 years ago

Yeah, we have already located the code causing this issue, and we still need a week to fix and test. In fact, we have a lot of other things to do.

Jiacheng-WU commented 3 years ago

Him @zhczhong , we have temporarily fixed this issue in our "develop" branch. To solve the precision problem, we introduce a determined bias on key in each node, which would somehow hurt the performance slightly.

zhczhong commented 3 years ago

Hi Jiacheng,

Thanks for your quick response and effort. But when I use the code in develop branch to load 200M keys, it will trigger the assertion. The code below could be used to reproduce it.

#include <lipp.h>
#include <iostream>
#include <vector>
#include <cassert>
#include <algorithm>
#include <iostream>
#include <fstream>

using namespace std;

template<class T>
bool load_binary_data(T data[], int length, const std::string &file_path) {
    std::ifstream is(file_path.c_str(), std::ios::binary | std::ios::in);
    if (!is.is_open()) {
        return false;
    }
    is.read(reinterpret_cast<char *>(data), std::streamsize(length * sizeof(T)));
    is.close();

    return true;
}

int main()
{
    LIPP<uint64_t, uint64_t> lipp;
    size_t data_size = 200000000;

    // prepare data
    uint64_t *keys = new uint64_t[data_size];
    load_binary_data(keys, data_size, "./osm_cellids_200M_uint64");

    vector<pair<uint64_t, uint64_t>> data;
    std::sort(keys, keys + data_size);
    for (size_t i = 0; i < data_size; i ++) {
        data.push_back({keys[i], keys[i]});
    }

    // bulk load
    lipp.bulk_load(data.data(), data.size());

    // verify LIPP data structure
    lipp.verify();

    cout << "bolk load success" << endl;

    return 0;
}

The output is

RT_ASSERT Error at ./code/lipp/src/core/lipp.h:675, `next - offset <= (size+2) / 3` 
Jiacheng-WU commented 3 years ago

Oh, we forget to remove this assertion.

Jiacheng-WU commented 3 years ago

We have updated the develop branch and you could try again.

zhczhong commented 3 years ago

Thanks, the problem is solved! But the current performance seems to be a little worse than alex.

Jiacheng-WU commented 3 years ago

Probably, different hardware configurations (CPU frequency, memory bandwidth, or instruction set) will influence the overall performance. Besides, different workloads and datasets also have an impact on the index structure and thus ruin the performance.