apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.47k stars 3.52k forks source link

[C++][CI] a crossbow job with MinRelSize enabled #31132

Closed asfimport closed 2 years ago

asfimport commented 2 years ago

Reporter: Jonathan Keane / @jonkeane Assignee: Kouhei Sutou / @kou

Related issues:

Note: This issue was originally created as ARROW-15678. Please see the migration documentation for further details.

asfimport commented 2 years ago

Jonathan Keane / @jonkeane: The pull request linked has the starts of this — but there's still an unidentified segfault in one of the tests

asfimport commented 2 years ago

Weston Pace / @westonpace: Ok, so I was finally able to track this down. Fortunately (unfortunately?) it is not really a compiler bug (or maybe it is, I'm not sure). At the very least, I think we can avoid it.

level_comparison.cc is compiled with -msse4.2.

level_comparison_avx2.cc is compiled with -mavx2

This is expected and the functions they generate are housed in separate namespaces so they don't get confused. However, both functions rely on the function arrow::internal::FirstTimeBitmapWriter::AppendWord. The function is not templated but it is defined in the header file (and is not marked inline). I'm not really sure how we aren't getting a duplicate symbol error but some reading suggests it is implicitly inlined at link time.

In the object file (libparquet.a), there are two identical symbols named __ZN5arrow8internal21FirstTimeBitmapWriter10AppendWordEyx. One of them has SHLX and one of them has SHL. This disassembly of the SHLX version matches exactly the disassembly in the stack trace that @jonkeane posted in the PR. The two calling functions are (parquet::internal::standard::DefLevelsBatchToBitmap and parquet::internal::bmi2::DefLevelsBatchToBitmap.

So I think, the -O3 version is inlining the functions. The -Os version is not (-Os seems to discourage inlining in general). The linker is then faced with two identical symbols and just picks one (again, trying to optimize for size). It just so happens the version it picked was the one with SHLX.

So, as a test, we can try splitting the implementation part of bitmap_writer.h into bitmap_writer.cc (at least for FirstTimeBitmapWriter). The .cc file should then only be compiled once (with sse4.2). However, it's very possible we are just hitting the tip of the iceberg here, as any header file linked in by these avx2 compiled versions could be a ticking time bomb.

asfimport commented 2 years ago

David Li / @lidavidm: Good catch. This is exactly the same problem we ran into before with kernels: https://github.com/apache/arrow/blob/6c10a389bbc35b67187930dc0db2a88671e76c2d/cpp/src/arrow/compute/kernels/aggregate_internal.h#L135-L138 (ARROW-13382). I wonder if we should reconsider the plan of vectorizing kernels by rebuilding the same source multiple times given this potential pitfall.

asfimport commented 2 years ago

Weston Pace / @westonpace: We can perhaps add a static check that reports symbols outside the appropriate namespace. We might need some configurable suppression. For example, level_conversion_bmi2.cc.o would report:


0000000000000000 W arrow::util::ArrowLogBase& arrow::util::ArrowLogBase::operator<< <char [51]>(char const (&) [51])
0000000000000000 W std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > arrow::util::StringBuilder<char const (&) [33]>(char const (&) [33])
0000000000000000 W void arrow::util::StringBuilderRecursive<char const (&) [33]>(std::ostream&, char const (&) [33])
0000000000000000 W arrow::util::detail::StringStreamWrapper::stream()
0000000000000000 W arrow::util::Voidify::operator&(arrow::util::ArrowLogBase&)
0000000000000000 W arrow::util::Voidify::Voidify()
0000000000000000 W arrow::bit_util::BytesForBits(long)
0000000000000000 W arrow::internal::FirstTimeBitmapWriter::AppendWord(unsigned long, long)
0000000000000000 W arrow::internal::FirstTimeBitmapWriter::Finish()
0000000000000000 W arrow::internal::FirstTimeBitmapWriter::FirstTimeBitmapWriter(unsigned char*, long, long)
0000000000000000 W parquet::ParquetException::ParquetException<char const (&) [33]>(char const (&) [33])
0000000000000000 W parquet::ParquetException::~ParquetException()
0000000000000000 W parquet::ParquetException::~ParquetException()
0000000000000000 W arrow::internal::FirstTimeBitmapWriter::position() const
0000000000000000 W parquet::ParquetException::what() const
0000000000000000 W std::exception::exception()
0000000000000000 W char const (&std::forward<char const (&) [33]>(std::remove_reference<char const (&) [33]>::type&)) [33]

level_comparison_avx.cc.o looks to be in better shape:


0000000000000000 W short const& std::max<short>(short const&, short const&)
0000000000000000 W short const& std::min<short>(short const&, short const&)

But yes, if we have a better solution for this problem it might be safer.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Wow, thanks for the diagnosis @westonpace. So, it turns out that our method for compiling multiple versions of code is violating the one-definition-rule for any inline function or method used in the caller code.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: @bkietz You may have some idea about how to fix this cleanly and reliably.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: So, currently we are doing something such as:


clang -c something_avx2.cc -mavx2 

An alternative would be not to pass the optimization flag on the command line but enable it selectively inside the source code, e.g.:


clang -c something_avx2.cc -DARROW_SPECIALIZED_SIMD_TARGET=avx2

namespace parquet {
namespace internal {
namespace PARQUET_IMPL_NAMESPACE {

#ifdef ARROW_SPECIALIZED_SIMD_TARGET

#define STRINGIFY_EXPANDED(a) ARROW_STRINGIFY(a)
#pragma clang attribute push (__attribute__((target( STRINGIFY_EXPANDED(ARROW_SPECIALIZED_SIMD_TARGET)) )), apply_to=function)

#endif

...

#ifdef ARROW_SPECIALIZED_SIMD_TARGET
#pragma clang attribute pop
#endif

}  // namespace PARQUET_IMPL_NAMESPACE
}  // namespace internal
}  // namespace parquet

This way we would avoid enabling the particular instruction set on code inlined from other headers. Of course, perhaps that's not actually desirable...

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: In any case, this is probably too involved a change for 8.0.0, so the 8.0.0 fix would simply to disable SIMD optimizations for Homebrew?

asfimport commented 2 years ago

Weston Pace / @westonpace: That seems like an good solution to me. I had no idea it was possible.

If we want the other headers to be included then we already have a bit of a solution demonstrated in level_conversion_inc.h. In the common header file you require some kind of target namespace to be defined.


namespace parquet {
namespace internal {
#ifndef PARQUET_IMPL_NAMESPACE
#error "PARQUET_IMPL_NAMESPACE must be defined"
#endif
namespace PARQUET_IMPL_NAMESPACE {
...
}  // namespace PARQUET_IMPL_NAMESPACE
}  // namespace internal
}  // namespace parquet

However, anything that includes one of these "common headers" must define that namespace...


#define PARQUET_IMPL_NAMESPACE standard
#include "parquet/level_conversion_inc.h"
#undef PARQUET_IMPL_NAMESPACE
asfimport commented 2 years ago

Jonathan Keane / @jonkeane: @kou Do you think you might be able to take a look at this?

The comment at https://github.com/apache/arrow/pull/12928#issuecomment-1105955726 has a good explanation of what's going on and following that there are a few possible fixes (though none of them were fully implemented or decided

asfimport commented 2 years ago

Kouhei Sutou / @kou: How about using template to distinct implementation for each architecture?


diff --git a/cpp/src/arrow/compute/kernels/codegen_internal.h b/cpp/src/arrow/compute/kernels/codegen_internal.h
index fa50427bc3..a4bd0eb586 100644
--- a/cpp/src/arrow/compute/kernels/codegen_internal.h
+++ b/cpp/src/arrow/compute/kernels/codegen_internal.h
@@ -710,8 +710,8 @@ struct ScalarUnaryNotNullStateful {
                        Datum* out) {
       Status st = Status::OK();
       ArrayData* out_arr = out->mutable_array();
-      FirstTimeBitmapWriter out_writer(out_arr->buffers[1]->mutable_data(),
-                                       out_arr->offset, out_arr->length);
+      FirstTimeBitmapWriter<> out_writer(out_arr->buffers[1]->mutable_data(),
+                                         out_arr->offset, out_arr->length);
       VisitArrayValuesInline<Arg0Type>(
           arg0,
           [&](Arg0Value v) {
diff --git a/cpp/src/arrow/compute/kernels/row_encoder.cc b/cpp/src/arrow/compute/kernels/row_encoder.cc
index 10a1f4cda5..26316ec315 100644
--- a/cpp/src/arrow/compute/kernels/row_encoder.cc
+++ b/cpp/src/arrow/compute/kernels/row_encoder.cc
@@ -42,7 +42,7 @@ Status KeyEncoder::DecodeNulls(MemoryPool* pool, int32_t length, uint8_t** encod
     ARROW_ASSIGN_OR_RAISE(*null_bitmap, AllocateBitmap(length, pool));
     uint8_t* validity = (*null_bitmap)->mutable_data();

-    FirstTimeBitmapWriter writer(validity, 0, length);
+    FirstTimeBitmapWriter<> writer(validity, 0, length);
     for (int32_t i = 0; i < length; ++i) {
       if (encoded_bytes[i][0] == kValidByte) {
         writer.Set();
diff --git a/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc b/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc
index 7d8d2edc4b..433df0f1b7 100644
--- a/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc
+++ b/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc
@@ -353,8 +353,8 @@ struct IsInVisitor {
     const auto& state = checked_cast<const SetLookupState<Type>&>(*ctx->state());
     ArrayData* output = out->mutable_array();

-    FirstTimeBitmapWriter writer(output->buffers[1]->mutable_data(), output->offset,
-                                 output->length);
+    FirstTimeBitmapWriter<> writer(output->buffers[1]->mutable_data(), output->offset,
+                                   output->length);

     VisitArrayDataInline<Type>(
         this->data,
diff --git a/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc b/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc
index 611601cab8..da7de1c277 100644
--- a/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc
+++ b/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc
@@ -1456,7 +1456,7 @@ struct MatchSubstringImpl {
         [&matcher](const void* raw_offsets, const uint8_t* data, int64_t length,
                    int64_t output_offset, uint8_t* output) {
           const offset_type* offsets = reinterpret_cast<const offset_type*>(raw_offsets);
-          FirstTimeBitmapWriter bitmap_writer(output, output_offset, length);
+          FirstTimeBitmapWriter<> bitmap_writer(output, output_offset, length);
           for (int64_t i = 0; i < length; ++i) {
             const char* current_data = reinterpret_cast<const char*>(data + offsets[i]);
             int64_t current_length = offsets[i + 1] - offsets[i];
diff --git a/cpp/src/arrow/util/bit_util_benchmark.cc b/cpp/src/arrow/util/bit_util_benchmark.cc
index 258fd27785..66a81b4e04 100644
--- a/cpp/src/arrow/util/bit_util_benchmark.cc
+++ b/cpp/src/arrow/util/bit_util_benchmark.cc
@@ -386,7 +386,7 @@ static void BitmapWriter(benchmark::State& state) {
 }

 static void FirstTimeBitmapWriter(benchmark::State& state) {
-  BenchmarkBitmapWriter<internal::FirstTimeBitmapWriter>(state, state.range(0));
+  BenchmarkBitmapWriter<internal::FirstTimeBitmapWriter<>>(state, state.range(0));
 }

 struct GenerateBitsFunctor {
diff --git a/cpp/src/arrow/util/bit_util_test.cc b/cpp/src/arrow/util/bit_util_test.cc
index 6c2aff4fbe..9b9f19feb1 100644
--- a/cpp/src/arrow/util/bit_util_test.cc
+++ b/cpp/src/arrow/util/bit_util_test.cc
@@ -832,14 +832,14 @@ TEST(FirstTimeBitmapWriter, NormalOperation) {
     const uint8_t fill_byte = static_cast<uint8_t>(fill_byte_int);
     {
       uint8_t bitmap[] = {fill_byte, fill_byte, fill_byte, fill_byte};
-      auto writer = internal::FirstTimeBitmapWriter(bitmap, 0, 12);
+      auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 0, 12);
       WriteVectorToWriter(writer, {0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1});
       //                      {0b00110110, 0b1010, 0, 0}
       ASSERT_BYTES_EQ(bitmap, {0x36, 0x0a});
     }
     {
       uint8_t bitmap[] = {fill_byte, fill_byte, fill_byte, fill_byte};
-      auto writer = internal::FirstTimeBitmapWriter(bitmap, 4, 12);
+      auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 4, 12);
       WriteVectorToWriter(writer, {0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1});
       //                      {0b00110110, 0b1010, 0, 0}
       ASSERT_BYTES_EQ(bitmap, {static_cast<uint8_t>(0x60 | (fill_byte & 0x0f)), 0xa3});
@@ -848,15 +848,15 @@ TEST(FirstTimeBitmapWriter, NormalOperation) {
     {
       uint8_t bitmap[] = {fill_byte, fill_byte, fill_byte, fill_byte};
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 0, 6);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 0, 6);
         WriteVectorToWriter(writer, {0, 1, 1, 0, 1, 1});
       }
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 6, 3);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 6, 3);
         WriteVectorToWriter(writer, {0, 0, 0});
       }
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 9, 3);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 9, 3);
         WriteVectorToWriter(writer, {1, 0, 1});
       }
       ASSERT_BYTES_EQ(bitmap, {0x36, 0x0a});
@@ -864,23 +864,23 @@ TEST(FirstTimeBitmapWriter, NormalOperation) {
     {
       uint8_t bitmap[] = {fill_byte, fill_byte, fill_byte, fill_byte};
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 4, 0);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 4, 0);
         WriteVectorToWriter(writer, {});
       }
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 4, 6);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 4, 6);
         WriteVectorToWriter(writer, {0, 1, 1, 0, 1, 1});
       }
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 10, 3);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 10, 3);
         WriteVectorToWriter(writer, {0, 0, 0});
       }
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 13, 0);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 13, 0);
         WriteVectorToWriter(writer, {});
       }
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 13, 3);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 13, 3);
         WriteVectorToWriter(writer, {1, 0, 1});
       }
       ASSERT_BYTES_EQ(bitmap, {static_cast<uint8_t>(0x60 | (fill_byte & 0x0f)), 0xa3});
@@ -900,8 +900,8 @@ TEST(FirstTimeBitmapWriter, AppendWordOffsetOverwritesCorrectBitsOnExistingByte)
   auto check_append = [](const std::string& expected_bits, int64_t offset) {
     std::vector<uint8_t> valid_bits = {0x00};
     constexpr int64_t kBitsAfterAppend = 8;
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), offset,
-                                           /*length=*/(8 * valid_bits.size()) - offset);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), offset,
+                                             /*length=*/(8 * valid_bits.size()) - offset);
     writer.AppendWord(/*word=*/0xFF, /*number_of_bits=*/kBitsAfterAppend - offset);
     writer.Finish();
     EXPECT_EQ(BitmapToString(valid_bits, kBitsAfterAppend), expected_bits);
@@ -918,8 +918,8 @@ TEST(FirstTimeBitmapWriter, AppendWordOffsetOverwritesCorrectBitsOnExistingByte)
   auto check_with_set = [](const std::string& expected_bits, int64_t offset) {
     std::vector<uint8_t> valid_bits = {0x1};
     constexpr int64_t kBitsAfterAppend = 8;
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), offset,
-                                           /*length=*/(8 * valid_bits.size()) - offset);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), offset,
+                                             /*length=*/(8 * valid_bits.size()) - offset);
     writer.AppendWord(/*word=*/0xFF, /*number_of_bits=*/kBitsAfterAppend - offset);
     writer.Finish();
     EXPECT_EQ(BitmapToString(valid_bits, kBitsAfterAppend), expected_bits);
@@ -936,8 +936,8 @@ TEST(FirstTimeBitmapWriter, AppendWordOffsetOverwritesCorrectBitsOnExistingByte)
   auto check_with_preceding = [](const std::string& expected_bits, int64_t offset) {
     std::vector<uint8_t> valid_bits = {0xFF};
     constexpr int64_t kBitsAfterAppend = 8;
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), offset,
-                                           /*length=*/(8 * valid_bits.size()) - offset);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), offset,
+                                             /*length=*/(8 * valid_bits.size()) - offset);
     writer.AppendWord(/*word=*/0xFF, /*number_of_bits=*/kBitsAfterAppend - offset);
     writer.Finish();
     EXPECT_EQ(BitmapToString(valid_bits, kBitsAfterAppend), expected_bits);
@@ -954,8 +954,8 @@ TEST(FirstTimeBitmapWriter, AppendWordOffsetOverwritesCorrectBitsOnExistingByte)

 TEST(FirstTimeBitmapWriter, AppendZeroBitsHasNoImpact) {
   std::vector<uint8_t> valid_bits(/*count=*/1, 0);
-  internal::FirstTimeBitmapWriter writer(valid_bits.data(), /*start_offset=*/1,
-                                         /*length=*/valid_bits.size() * 8);
+  internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), /*start_offset=*/1,
+                                           /*length=*/valid_bits.size() * 8);
   writer.AppendWord(/*word=*/0xFF, /*number_of_bits=*/0);
   writer.AppendWord(/*word=*/0xFF, /*number_of_bits=*/0);
   writer.AppendWord(/*word=*/0x01, /*number_of_bits=*/1);
@@ -966,8 +966,8 @@ TEST(FirstTimeBitmapWriter, AppendZeroBitsHasNoImpact) {
 TEST(FirstTimeBitmapWriter, AppendLessThanByte) {
   {
     std::vector<uint8_t> valid_bits(/*count*/ 8, 0);
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), /*start_offset=*/1,
-                                           /*length=*/8);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), /*start_offset=*/1,
+                                             /*length=*/8);
     writer.AppendWord(0xB, 4);
     writer.Finish();
     EXPECT_EQ(BitmapToString(valid_bits, /*bit_count=*/8), "01101000");
@@ -975,8 +975,8 @@ TEST(FirstTimeBitmapWriter, AppendLessThanByte) {
   {
     // Test with all bits initially set.
     std::vector<uint8_t> valid_bits(/*count*/ 8, 0xFF);
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), /*start_offset=*/1,
-                                           /*length=*/8);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), /*start_offset=*/1,
+                                             /*length=*/8);
     writer.AppendWord(0xB, 4);
     writer.Finish();
     EXPECT_EQ(BitmapToString(valid_bits, /*bit_count=*/8), "11101000");
@@ -986,8 +986,8 @@ TEST(FirstTimeBitmapWriter, AppendLessThanByte) {
 TEST(FirstTimeBitmapWriter, AppendByteThenMore) {
   {
     std::vector<uint8_t> valid_bits(/*count*/ 8, 0);
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), /*start_offset=*/0,
-                                           /*length=*/9);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), /*start_offset=*/0,
+                                             /*length=*/9);
     writer.AppendWord(0xC3, 8);
     writer.AppendWord(0x01, 1);
     writer.Finish();
@@ -995,8 +995,8 @@ TEST(FirstTimeBitmapWriter, AppendByteThenMore) {
   }
   {
     std::vector<uint8_t> valid_bits(/*count*/ 8, 0xFF);
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), /*start_offset=*/0,
-                                           /*length=*/9);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), /*start_offset=*/0,
+                                             /*length=*/9);
     writer.AppendWord(0xC3, 8);
     writer.AppendWord(0x01, 1);
     writer.Finish();
@@ -1012,8 +1012,8 @@ TEST(FirstTimeBitmapWriter, AppendWordShiftsBitsCorrectly) {
     ASSERT_GE(offset, 8);
     std::vector<uint8_t> valid_bits(/*count=*/10, preset_buffer_bits ? 0xFF : 0);
     valid_bits[0] = 0x99;
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), offset,
-                                           /*length=*/(9 * sizeof(kPattern)) - offset);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), offset,
+                                             /*length=*/(9 * sizeof(kPattern)) - offset);
     writer.AppendWord(/*word=*/kPattern, /*number_of_bits=*/64);
     writer.Finish();
     EXPECT_EQ(valid_bits[0], 0x99);  // shouldn't get changed.
@@ -1051,15 +1051,15 @@ TEST(TestAppendBitmap, AppendWordOnlyAppropriateBytesWritten) {

   uint64_t bitmap = 0x1FF;
   {
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), /*start_offset=*/1,
-                                           /*length=*/(8 * valid_bits.size()) - 1);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), /*start_offset=*/1,
+                                             /*length=*/(8 * valid_bits.size()) - 1);
     writer.AppendWord(bitmap, /*number_of_bits*/ 7);
     writer.Finish();
     EXPECT_THAT(valid_bits, ElementsAreArray(std::vector<uint8_t>{0xFE, 0x00}));
   }
   {
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), /*start_offset=*/1,
-                                           /*length=*/(8 * valid_bits.size()) - 1);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), /*start_offset=*/1,
+                                             /*length=*/(8 * valid_bits.size()) - 1);
     writer.AppendWord(bitmap, /*number_of_bits*/ 8);
     writer.Finish();
     EXPECT_THAT(valid_bits, ElementsAreArray(std::vector<uint8_t>{0xFE, 0x03}));
diff --git a/cpp/src/arrow/util/bitmap_writer.h b/cpp/src/arrow/util/bitmap_writer.h
index 65d0d188d7..7a70b16f15 100644
--- a/cpp/src/arrow/util/bitmap_writer.h
+++ b/cpp/src/arrow/util/bitmap_writer.h
@@ -21,6 +21,8 @@
 #include <cstring>

 #include "arrow/util/bit_util.h"
+#include "arrow/util/config.h"
+#include "arrow/util/dispatch.h"
 #include "arrow/util/endian.h"
 #include "arrow/util/macros.h"

@@ -78,6 +80,7 @@ class BitmapWriter {
   int64_t byte_offset_;
 };

+template <DispatchLevel level = ARROW_COMPILE_TIME_DISPATCH_LEVEL>
 class FirstTimeBitmapWriter {
   // Like BitmapWriter, but any bit values *following* the bits written
   // might be clobbered.  It is hence faster than BitmapWriter, and can
diff --git a/cpp/src/arrow/util/config.h.cmake b/cpp/src/arrow/util/config.h.cmake
index 55bc2d0100..1ecb4f39f3 100644
--- a/cpp/src/arrow/util/config.h.cmake
+++ b/cpp/src/arrow/util/config.h.cmake
@@ -36,6 +36,8 @@

 #define ARROW_PACKAGE_KIND "@ARROW_PACKAGE_KIND@"

+#define ARROW_COMPILE_TIME_DISPATCH_LEVEL ::arrow::internal::DispatchLevel::@ARROW_SIMD_LEVEL@
+
 #cmakedefine ARROW_COMPUTE
 #cmakedefine ARROW_CSV
 #cmakedefine ARROW_CUDA
diff --git a/cpp/src/arrow/util/dispatch.h b/cpp/src/arrow/util/dispatch.h
index fae9293f9e..d6f4dbb028 100644
--- a/cpp/src/arrow/util/dispatch.h
+++ b/cpp/src/arrow/util/dispatch.h
@@ -33,6 +33,7 @@ enum class DispatchLevel : int {
   SSE4_2,
   AVX2,
   AVX512,
+  BMI2,
   NEON,
   MAX
 };
@@ -105,6 +106,8 @@ class DynamicDispatch {
         return cpu_info->IsSupported(CpuInfo::AVX2);
       case DispatchLevel::AVX512:
         return cpu_info->IsSupported(CpuInfo::AVX512);
+      case DispatchLevel::BMI2:
+        return cpu_info->IsSupported(CpuInfo::BMI2);
       default:
         return false;
     }
diff --git a/cpp/src/parquet/encoding.cc b/cpp/src/parquet/encoding.cc
index af0e543c3e..34f0eef3b5 100644
--- a/cpp/src/parquet/encoding.cc
+++ b/cpp/src/parquet/encoding.cc
@@ -348,8 +348,8 @@ class PlainEncoder<BooleanType> : public EncoderImpl, virtual public BooleanEnco
     } else {
       auto n_valid = bit_util::BytesForBits(data.length() - data.null_count());
       PARQUET_THROW_NOT_OK(sink_.Reserve(n_valid));
-      ::arrow::internal::FirstTimeBitmapWriter writer(sink_.mutable_data(),
-                                                      sink_.length(), n_valid);
+      ::arrow::internal::FirstTimeBitmapWriter<> writer(sink_.mutable_data(),
+                                                        sink_.length(), n_valid);

       for (int64_t i = 0; i < data.length(); i++) {
         if (data.IsValid(i)) {
diff --git a/cpp/src/parquet/level_comparison_avx2.cc b/cpp/src/parquet/level_comparison_avx2.cc
index b33eb2e295..521cf96520 100644
--- a/cpp/src/parquet/level_comparison_avx2.cc
+++ b/cpp/src/parquet/level_comparison_avx2.cc
@@ -16,7 +16,9 @@
 // under the License.

 #define PARQUET_IMPL_NAMESPACE avx2
+#define PARQUET_DISPATCH_LEVEL ::arrow::internal::DispatchLevel::AVX2
 #include "parquet/level_comparison_inc.h"
+#undef PARQUET_DISPATCH_LEVEL
 #undef PARQUET_IMPL_NAMESPACE

 namespace parquet {
diff --git a/cpp/src/parquet/level_conversion.cc b/cpp/src/parquet/level_conversion.cc
index ffdca476dd..ab440af95a 100644
--- a/cpp/src/parquet/level_conversion.cc
+++ b/cpp/src/parquet/level_conversion.cc
@@ -28,7 +28,9 @@

 #include "parquet/level_comparison.h"
 #define PARQUET_IMPL_NAMESPACE standard
+#define PARQUET_DISPATCH_LEVEL ARROW_COMPILE_TIME_DISPATCH_LEVEL
 #include "parquet/level_conversion_inc.h"
+#undef PARQUET_DISPATCH_LEVEL
 #undef PARQUET_IMPL_NAMESPACE

 namespace parquet {
@@ -43,7 +45,7 @@ void DefRepLevelsToListInfo(const int16_t* def_levels, const int16_t* rep_levels
                             int64_t num_def_levels, LevelInfo level_info,
                             ValidityBitmapInputOutput* output, OffsetType* offsets) {
   OffsetType* orig_pos = offsets;
-  optional<::arrow::internal::FirstTimeBitmapWriter> valid_bits_writer;
+  optional<::arrow::internal::FirstTimeBitmapWriter<>> valid_bits_writer;
   if (output->valid_bits) {
     valid_bits_writer.emplace(output->valid_bits, output->valid_bits_offset,
                               output->values_read_upper_bound);
diff --git a/cpp/src/parquet/level_conversion_bmi2.cc b/cpp/src/parquet/level_conversion_bmi2.cc
index 274d54e503..679d01d0c9 100644
--- a/cpp/src/parquet/level_conversion_bmi2.cc
+++ b/cpp/src/parquet/level_conversion_bmi2.cc
@@ -17,7 +17,9 @@
 #include "parquet/level_conversion.h"

 #define PARQUET_IMPL_NAMESPACE bmi2
+#define PARQUET_DISPATCH_LEVEL ::arrow::internal::DispatchLevel::BMI2
 #include "parquet/level_conversion_inc.h"
+#undef PARQUET_DISPATCH_LEVEL
 #undef PARQUET_IMPL_NAMESPACE

 namespace parquet {
diff --git a/cpp/src/parquet/level_conversion_inc.h b/cpp/src/parquet/level_conversion_inc.h
index 710d2f6237..4b5a9def80 100644
--- a/cpp/src/parquet/level_conversion_inc.h
+++ b/cpp/src/parquet/level_conversion_inc.h
@@ -296,7 +296,10 @@ static constexpr int64_t kExtractBitsSize = 8 * sizeof(extract_bitmap_t);
 template <bool has_repeated_parent>
 int64_t DefLevelsBatchToBitmap(const int16_t* def_levels, const int64_t batch_size,
                                int64_t upper_bound_remaining, LevelInfo level_info,
-                               ::arrow::internal::FirstTimeBitmapWriter* writer) {
+#ifndef PARQUET_DISPATCH_LEVEL
+#error "PARQUET_DISPATCH_LEVEL must be defined"
+#endif
+                               ::arrow::internal::FirstTimeBitmapWriter<PARQUET_DISPATCH_LEVEL>* writer) {
   DCHECK_LE(batch_size, kExtractBitsSize);

   // Greater than level_info.def_level - 1 implies >= the def_level
@@ -330,7 +333,7 @@ int64_t DefLevelsBatchToBitmap(const int16_t* def_levels, const int64_t batch_si
 template <bool has_repeated_parent>
 void DefLevelsToBitmapSimd(const int16_t* def_levels, int64_t num_def_levels,
                            LevelInfo level_info, ValidityBitmapInputOutput* output) {
-  ::arrow::internal::FirstTimeBitmapWriter writer(
+  ::arrow::internal::FirstTimeBitmapWriter<PARQUET_DISPATCH_LEVEL> writer(
       output->valid_bits,
       /*start_offset=*/output->valid_bits_offset,
       /*length=*/output->values_read_upper_bound);
asfimport commented 2 years ago

Antoine Pitrou / @pitrou: @kou We can do that for the specific symptoms here. However, a more general solution will have to be found since other files have the same problem: compiling SIMD-specific code which calls into other routines.

asfimport commented 2 years ago

Ben Kietzman / @bkietz: @pitrou The most robust solution I can think of is to avoid linking between objects with differing instruction sets altogether. We'd have something like


$ nm libarrow_compute_avx2.so | grep DefLevelsBitmapSimd
0000000000404ff0 t DefLevelsBitmapSimd

That library would be acquired with dlopen(path, RTLD_LOCAL)/LoadLibrary(path) which would guarantee that any functions like FirstTimeBitmapWriter::* which might have been recompiled with illegal instructions are not available outside libarrow_compute_avx2.so.

asfimport commented 2 years ago

Ben Kietzman / @bkietz: IIUC, we'll still need to pass -mavx2 so that we can include immintrin.h so the attribute described in the ARROW_SPECIALIZED_SIMD_TARGET approach would need to be attached to the {}non{}-SIMD functions to ensure that they're compiled with no special instructions

... or I suppose we could try to declare all the intrinsics manually at function scope


ARROW_SIMD_FUNCTION(avx2) void SimdThing() {
  // inlined from immintrin.h:
  typedef unsigned short __mmask16;
  extern
    __inline
    __mmask16
    __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
        _mm512_int2mask (int __M);
}
asfimport commented 2 years ago

Ben Kietzman / @bkietz: On further investigation, we can include immintrin.h with or without -mavx2 and clang at least will not complain unless the intrinsics are referenced, so


#include <immintrin.h>

[[gnu::target("avx2")]]
void use_simd() {
  __m256i arg;
  _mm256_abs_epi16 (arg);
}

int main() { use_simd(); }

compiles and runs happily without any special compilation flags. Using an attribute like this seems viable provided we can be certain that the modified target isn't transitively applied to functions which might be invoked for the first time inside a SIMD enabled function

asfimport commented 2 years ago

Ian Cook / @ianmcook: @jonkeane  this issue is marked as a blocker for 9.0.0. Should this block the release?

asfimport commented 2 years ago

Jonathan Keane / @jonkeane: Last I checked, the homebrew maintainers have said that they will disable all optimization for arrow if we don't get this sorted on our own. So not required if we're ok with that (though we should engage with them on this)

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Disabling all optimizations for Arrow is brutal. It would be much better to simply disable runtime SIMD optimizations (by passing -DARROW_RUNTIME_SIMD_LEVEL=NONE to CMake, AFAIR).

asfimport commented 2 years ago

Raúl Cumplido / @raulcd: Is this still a blocker?

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Ideally it would... But there's little chance for it to be fixed in time for 9.0.0.

As I said above, the workaround should be to disable runtime SIMD optimizations on the affected builds. Somehow has to validate that suggestion, though (i.e. someone who's able to reproduce this issue).

asfimport commented 2 years ago

Jacob Wujciak / @assignUser: Looking at [ARROW-15664] and this PR it seems like a workaround has been implemented for homebrew IIUC, so this is still an issue but as the real fix wont happen for 9.0.0 it shouldn't be a blocker anymore?

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: If that was actually accepted by Homebrew then fine.

asfimport commented 2 years ago

Jacob Wujciak / @assignUser: That was my impression: issue and PR in homebrew-core. Maybe @jonkeane can confirm?

asfimport commented 2 years ago

Jonathan Keane / @jonkeane: Homebrew only accepted that as a temporary workaround and has threatened to turn off optimizations if we don't resolve this. They haven't yet followed through yet, though. https://github.com/Homebrew/homebrew-core/issues/94724#issuecomment-1063031123

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Note that I suggested a perhaps more acceptable workaround above.

asfimport commented 2 years ago

Krisztian Szucs / @kszucs: @jonkeane can you give an update on this issue?

asfimport commented 2 years ago

Krisztian Szucs / @kszucs: Postponing to 10.0 for now.

asfimport commented 2 years ago

Jonathan Keane / @jonkeane: I have no updates beyond what's discussed above: there are a few approaches, none of them ideal, we need someone to champion this (or risk the homebrew maintainers turning off optimizations on us)

asfimport commented 2 years ago

Raúl Cumplido / @raulcd: This was a blocker for the last release and is still a blocker for the 10.0.0 release. @jonkeane do you know if there has been any move?

asfimport commented 2 years ago

Jonathan Keane / @jonkeane: I thought that @kou was going to take a look at this (or at least the underlying multiple SIMD instruction ordering issue that causes the failures...)

The only update I have is that I continue to run into the segfault in CI for downstream projects I'm working on, so it continues to be an issue for pre-built libarrow on machines like github's macos runners.

asfimport commented 2 years ago

Kouhei Sutou / @kou: Yes. I will fix this before the 10.0.0 release. Sorry for not working on this yet.

asfimport commented 2 years ago

Kouhei Sutou / @kou: Summary of this problem:

Problem:

  1. Force to inline functions that are called from the code that are compiled with SIMD related optimization flags
  2. Restrict SIMD related optimization area to only the target function

    For 1., we have two approaches for it:

    This approach seems ad-hoc. We need to apply this approach to called functions when we find this problem in other codes.

    For 2., we have one approach for it:

    (I'm not sure that this approach is portable. For example, can we use this approach with MSVC?)

asfimport commented 2 years ago

Kouhei Sutou / @kou: I propose one more approach for the proposed solution 1.:

How about always enabling inline optimization for SIMD optimized compile units (level_conversion_bmi2.cc) even when an user specifies -DCMAKE_BUILD_TYPE=MinSizeRel?

It may increases binary size but it may be better that SIMD related code prioritizes performance than binary size.

We don't need to write manual template/\_\_attribute\_\_((always\_inline)) s with this approach.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Well, I don't think we can force the compiler to inline everything.

asfimport commented 2 years ago

Kouhei Sutou / @kou: We don't have this problem with -DCMAKE_BUILD_TYPE=Release. So it may work with most cases.

asfimport commented 2 years ago

Kouhei Sutou / @kou: Issue resolved by pull request 14342 https://github.com/apache/arrow/pull/14342