OpenCilk / opencilk-project

Monorepo for the OpenCilk compiler. Forked from llvm/llvm-project and based on Tapir/LLVM.
Other
93 stars 29 forks source link

Incorrect Parallel execution #148

Closed wheatman closed 9 months ago

wheatman commented 1 year ago

Describe the bug

Races seem to be created in a parallel for loop which is read only to shared data and writes to only local data. The race then causes a out of memory crash

Expected behavior

For the program to not have races

OpenCilk version

clang version 14.0.6 (https://github.com/OpenCilk/opencilk-project fc90ded2b090672f84c58d12d8d85cd999eb6c1a) Target: x86_64-unknown-linux-gnu Thread model: posix

Steps to reproduce (include relevant output)

I found it with the tlx repo which can be cloned from https://github.com/tlx/tlx.git

I found the bug in code which looked like this


#include <algorithm>
#include <cstdint>
#include <cstdio>
#include <random>
#include <sys/time.h>
#include <vector>

#include <cilk/cilk.h>

#include "tlx/container/btree_set.hpp"

std::vector<uint64_t> create_random_data(size_t n, size_t max_val,
                                         std::seed_seq &seed) {

  std::mt19937_64 eng(seed); // a source of random data

  std::uniform_int_distribution<uint64_t> dist(0, max_val);
  std::vector<uint64_t> v(n);

  generate(begin(v), end(v), std::bind(dist, eng));
  return v;
}

void test_parallel_merge_map1(uint64_t max_size, std::seed_seq &seed) {

  uint64_t num_chunks = 480;
  uint64_t chunk_size = 10000 / num_chunks;

  std::vector<uint64_t> data = create_random_data(max_size, 10000, seed);
  std::seed_seq seed2{1};
  std::vector<uint64_t> data2 = create_random_data(max_size, 10000, seed2);

  tlx::btree_set<uint64_t> map1(data.begin(), data.end());
  tlx::btree_set<uint64_t> map2(data2.begin(), data2.end());

  cilk_for(uint64_t chunk_idx = 0; chunk_idx < num_chunks; chunk_idx++) {
    uint64_t start_key = 1 + chunk_idx * chunk_size;
    uint64_t end_key = (chunk_idx == num_chunks - 1)
                           ? std::numeric_limits<uint64_t>::max()
                           : 1 + (chunk_idx + 1) * chunk_size;
    std::vector<uint64_t> merged_chunk;
    tlx::btree_set<uint64_t>::const_iterator start1 =
        map1.lower_bound(start_key);
    tlx::btree_set<uint64_t>::const_iterator start2 =
        map2.lower_bound(start_key);
    tlx::btree_set<uint64_t>::const_iterator end1 = map1.upper_bound(end_key);
    tlx::btree_set<uint64_t>::const_iterator end2 = map2.upper_bound(end_key);
    std::merge(start1, end1, start2, end2, std::back_inserter(merged_chunk));
  }

  return;
}

int main() {
  std::seed_seq seed{0};
  test_parallel_merge_map1(1000, seed);

  return 0;
}

When I run this it is killed due to using up all memory, cilk_san shows the following races, but note that std::merge, should be read only to its input iterators (which are also const_iterators) and only write to the local output.

Running Cilksan race detector.
Race detected on location 7fff6e4b2178
*    Write 487b65 merge<tlx::BTree<unsigned long, unsigned long, tlx::btree_set<unsigned long, std::less<unsigned long>, tlx::btree_default_traits<unsigned long, unsigned long>, std::allocator<unsigned long> >::key_of_value, std::less<unsigned long>, tlx::btree_default_traits<unsigned long, unsigned long>, false, std::allocator<unsigned long> >::const_iterator, tlx::BTree<unsigned long, unsigned long, tlx::btree_set<unsigned long, std::less<unsigned long>, tlx::btree_default_traits<unsigned long, unsigned long>, std::allocator<unsigned long> >::key_of_value, std::less<unsigned long>, tlx::btree_default_traits<unsigned long, unsigned long>, false, std::allocator<unsigned long> >::const_iterator, std::back_insert_iterator<std::vector<unsigned long, std::allocator<unsigned long> > > > /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_algo.h:4927:27
|        `-to variable __last2 (declared at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_algo.h:4928)
|*   Write 487b65 merge<tlx::BTree<unsigned long, unsigned long, tlx::btree_set<unsigned long, std::less<unsigned long>, tlx::btree_default_traits<unsigned long, unsigned long>, std::allocator<unsigned long> >::key_of_value, std::less<unsigned long>, tlx::btree_default_traits<unsigned long, unsigned long>, false, std::allocator<unsigned long> >::const_iterator, tlx::BTree<unsigned long, unsigned long, tlx::btree_set<unsigned long, std::less<unsigned long>, tlx::btree_default_traits<unsigned long, unsigned long>, std::allocator<unsigned long> >::key_of_value, std::less<unsigned long>, tlx::btree_default_traits<unsigned long, unsigned long>, false, std::allocator<unsigned long> >::const_iterator, std::back_insert_iterator<std::vector<unsigned long, std::allocator<unsigned long> > > > /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_algo.h:4927:27
||       `-to variable __last2 (declared at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_algo.h:4928)
\| Common calling context
 +  Parfor 48543b test_parallel_merge_map1 /home/wheatman/crash_testing/test.cpp:37:3
 +    Call 486d5e main /home/wheatman/crash_testing/test.cpp:89:3
   Allocation context
    Stack object __last2 (declared at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_algo.h:4928)
     Alloc 48503f in test_parallel_merge_map1 /home/wheatman/crash_testing/test.cpp
      Call 486d5e main /home/wheatman/crash_testing/test.cpp:89:3

Race detected on location 7fff6e4b2180
*    Write 487b83 merge<tlx::BTree<unsigned long, unsigned long, tlx::btree_set<unsigned long, std::less<unsigned long>, tlx::btree_default_traits<unsigned long, unsigned long>, std::allocator<unsigned long> >::key_of_value, std::less<unsigned long>, tlx::btree_default_traits<unsigned long, unsigned long>, false, std::allocator<unsigned long> >::const_iterator, tlx::BTree<unsigned long, unsigned long, tlx::btree_set<unsigned long, std::less<unsigned long>, tlx::btree_default_traits<unsigned long, unsigned long>, std::allocator<unsigned long> >::key_of_value, std::less<unsigned long>, tlx::btree_default_traits<unsigned long, unsigned long>, false, std::allocator<unsigned long> >::const_iterator, std::back_insert_iterator<std::vector<unsigned long, std::allocator<unsigned long> > > > /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_algo.h:4927:27
|        `-to variable __last2 (declared at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_algo.h:4928)
|*   Write 487b83 merge<tlx::BTree<unsigned long, unsigned long, tlx::btree_set<unsigned long, std::less<unsigned long>, tlx::btree_default_traits<unsigned long, unsigned long>, std::allocator<unsigned long> >::key_of_value, std::less<unsigned long>, tlx::btree_default_traits<unsigned long, unsigned long>, false, std::allocator<unsigned long> >::const_iterator, tlx::BTree<unsigned long, unsigned long, tlx::btree_set<unsigned long, std::less<unsigned long>, tlx::btree_default_traits<unsigned long, unsigned long>, std::allocator<unsigned long> >::key_of_value, std::less<unsigned long>, tlx::btree_default_traits<unsigned long, unsigned long>, false, std::allocator<unsigned long> >::const_iterator, std::back_insert_iterator<std::vector<unsigned long, std::allocator<unsigned long> > > > /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_algo.h:4927:27
||       `-to variable __last2 (declared at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_algo.h:4928)
\| Common calling context
 +  Parfor 48543b test_parallel_merge_map1 /home/wheatman/crash_testing/test.cpp:37:3
 +    Call 486d5e main /home/wheatman/crash_testing/test.cpp:89:3
   Allocation context
    Stack object __last2 (declared at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_algo.h:4928)
     Alloc 48503f in test_parallel_merge_map1 /home/wheatman/crash_testing/test.cpp
      Call 486d5e main /home/wheatman/crash_testing/test.cpp:89:3

I thought this could have something similar to #147 with the allocation being hoisted so I rewrote the function in the following way so that all allocations happened up front and the issue remained

void test_parallel_merge_map2(uint64_t max_size, std::seed_seq &seed) {

  uint64_t num_chunks = 480;
  uint64_t chunk_size = 10000 / num_chunks;

  std::vector<uint64_t> data = create_random_data(max_size, 10000, seed);
  std::seed_seq seed2{1};
  std::vector<uint64_t> data2 = create_random_data(max_size, 10000, seed2);

  tlx::btree_set<uint64_t> map1(data.begin(), data.end());
  tlx::btree_set<uint64_t> map2(data2.begin(), data2.end());

  std::vector<std::vector<uint64_t>> chunks(num_chunks);

  cilk_for(uint64_t chunk_idx = 0; chunk_idx < num_chunks; chunk_idx++) {
    uint64_t start_key = 1 + chunk_idx * chunk_size;
    uint64_t end_key = (chunk_idx == num_chunks - 1)
                           ? std::numeric_limits<uint64_t>::max()
                           : 1 + (chunk_idx + 1) * chunk_size;
    tlx::btree_set<uint64_t>::const_iterator start1 =
        map1.lower_bound(start_key);
    tlx::btree_set<uint64_t>::const_iterator start2 =
        map2.lower_bound(start_key);
    tlx::btree_set<uint64_t>::const_iterator end1 = map1.upper_bound(end_key);
    tlx::btree_set<uint64_t>::const_iterator end2 = map2.upper_bound(end_key);
    std::merge(start1, end1, start2, end2,
               std::back_inserter(chunks[chunk_idx]));
  }

  return;
}

I am compiling with ~/opencilk/build/bin/clang++ -std=c++20 test.cpp -DCILK=1 -fopencilk -O1 -Itlx/ -g

The issue does not happen if I compile with -O0

neboat commented 1 year ago

Thanks for the report. We'll look into it.

I haven't yet been able to replicate this failure on my end, but I'm setting up an environment where I might have better luck. We'll keep you posted.

wheatman commented 1 year ago

I originally found the issue on a ubuntu server from aws, then created the minimal example on wsl on my laptop, if that helps. I also checked to see if it was becuase I was on 2.0 instead of 2.0.1, but after updating to f54e98722992462e042f0686c74f710277c92cdb the issue was still there.

Also the out of memory crash does not happen every time, about 6 in 1o times, though the race is found every time. Also I tried to make a more minimal version but wasn't able to remove much, its attached if its at all helpful. test.txt

neboat commented 1 year ago

Following-up, I managed to replicate this issue and find a fix for it. I've created a PR with that fix, which I plan to merge once the tests finish running. The fix also seems to fix issue #147.

neboat commented 9 months ago

This issue is fixed in the latest release.