Portability issues with int128 type

majetideepak commented 2 years ago

The current Velox Long Decimal type uses int128_t type. However, we are seeing a couple of portability issues around int128 types. Some of them are: 1) Address sanitizer does not support int128_t well. 2) int128_t requires alignment on x86_64.

I want to discuss these issues further and would like to conclude if it is meaningful to continue to use this type for development or use 2 int64_t type values.

majetideepak commented 2 years ago

@mbasmanova @Yuhta @xiaoxmeng Since you are seeing most of these issues in your internal tests, can you share your thoughts?

Yuhta commented 2 years ago

So far I don't see these 2 limitations as showstopper. For 1 we just need to use the new mul defined in DecimalUtil.h. For 2 we need to be careful and annotate the alignment (__attribute__ ((aligned (16))) or alignas(16)) wherever needed.

mbasmanova commented 2 years ago

My preference would be to switch to 2 64-bit integers, but I'm fine continuing with int128 for some more time to see if things get better.

Yuhta commented 2 years ago

FYI I probably found the root cause of 1. It's the Meta internal system that does not link clang compiler_rt by default. Once I add this, no more annotation will be needed for this purpose.

Yuhta commented 2 years ago

Bad news, due to some conflicts with Rust code, we probably still need to keep the annotations. I am going to give it a final try to fix our clang next week, if failed we still need the annotation.

Yuhta commented 2 years ago

Just an update on this, I get some internal point of contact who will be fixing the Rust build system so that we can link compiler builtins properly. I will put some placeholder/slow implementation this week in CheckedArithematic.h to unblock the PRs for now (annotation is still needed for now).

Once everything is sorted out, no annotations will be needed and we will use __builtin_mul_overflow with __int128 in CheckedArithematic.h.

Yuhta commented 2 years ago

Issue 1 is solved. There is no longer need to exclude the sanitizer. For 2 we should experiment annotate UnscaledLongDecimal and see if the alignment can be transitively specified.

majetideepak commented 1 year ago

We are still seeing segfaults on CentOS Stream 8 release build which seem to be due to unaligned access. See PRs https://github.com/facebookincubator/velox/pull/3755 and https://github.com/facebookincubator/velox/pull/4129/file However, a simple reproducer with unaligned access seems to work fine on this machine.

 char* buf1 = reinterpret_cast<char*>(malloc(160));
  buf1[1] = 1;
  buf1[19] = 2;
  __int128_t* val1 = reinterpret_cast<__int128_t*>(buf1 + 1);
  __int128_t* val2 = reinterpret_cast<__int128_t*>(buf1 + 19);
  __int128_t result = val1[0] + val2[0];
  if (result == 3) {
     std::cout << "Success" << std::endl;
 }

I will investigate this further.

Yuhta commented 1 year ago

@majetideepak Try to reproduce it within a loop. Usually the alignment problem only shows up when vectorized

liujiayi771 commented 1 year ago

Will you consider using two int64 to express UnscaledLongDecimal in the future?

mbasmanova commented 1 year ago

Will you consider using two int64 to express UnscaledLongDecimal in the future?

That's a good question. @majetideepak Deepak, looks like we continue seeing issue with int128_t. Should we switch to a struct with 2 64-bit integers?

majetideepak commented 1 year ago

@mbasmanova My recommendation would be to continue to use the int128_t type to get the best possible platform-specific implementations. int128_t types have been supported for a while now by GCC and Clang. So it should be available in most user settings given Velox also requires C++17 support. The only issue we see so far with 128_t is the requirement for alignment for the corresponding SIMD instructions using GCC. If we can figure out a way for the compiler to generate unaligned instructions, we should be good. I am looking into this further now.

@liujiayi771 is there any reason to consider using two int64 values for UnscaledLongDecimal?

mbasmanova commented 1 year ago

The only issue we see so far with 128_t is the requirement for alignment for the corresponding SIMD instructions.

PR #3755 suggests that we are not able to reinterpret_cast char to in128_t even when memory is properly aligned. I guess we need to dig in more to understand the root cause.

majetideepak commented 1 year ago

reinterpret_cast is only a compiler directive, so there is no instruction associated. We also see these crashes only on GCC. At level O3, the code generated is highly optimized and only way is to look at the generated code. I plan to do this this week and will post updates here. A simple reproducer would help a lot too, but I don't have one yet.

liujiayi771 commented 1 year ago

@mbasmanova My recommendation would be to continue to use the int128_t type to get the best possible platform-specific implementations. int128_t types have been supported for a while now by GCC and Clang. So it should be available in most user settings given Velox also requires C++17 support. The only issue we see so far with 128_t is the requirement for alignment for the corresponding SIMD instructions using GCC. If we can figure out a way for the compiler to generate unaligned SIMD instructions, we should be good. I am looking into this further now.

@liujiayi771 is there any reason to consider using two int64 values for UnscaledLongDecimal?

I test the average aggregation of decimal, and get core dump in DecimalAggregate::initializeNewGroups. I think the reason for the core dump is to construct LongDecimalWithOverflowState. I refer to this PR and modify the code. There will be no error in constructing LongDecimalWithOverflowState.

majetideepak commented 1 year ago

The assembly instruction here is vmovaps and the SO answer describes the need for it to be aligned to 16 bytes. https://stackoverflow.com/questions/67243284/why-movaps-causes-segmentation-fault 0x21f9d02 <_ZN8facebook5velox9aggregate16DecimalAggregateINS0_20UnscaledShortDecimalES3_E19initializeNewGroupsEPPcN5folly5RangeIPKiEE+98> vmovaps %xmm0,(%r11)

majetideepak commented 1 year ago

The fix would be one of the following 1) Ensure we can write code such that all loads are aligned to Aggregate::accumulatorAlignmentSize() 2) Enable the GCC compiler to disable generating vectorized loads that need alignment for int128_t types.

(2) can be achieved by either of the following approaches wherever int128_t values are being loaded.

#if defined (__GNUC__) && (defined (__x86_64__) || defined (__i386__))
__attribute__ ((target ("no-sse")))
#endif

#pragma GCC push_options
#pragma GCC optimize ("O0")

code 

#pragma GCC pop_options

use volatile

I feel (1) is hard to achieve. (2) seems to be simpler and specific to int128_t types. I will work towards adding support for (2) if there are no objections. I will look for better compiler options to avoid vmovaps

Yuhta commented 1 year ago

@majetideepak We are likely to take performance hit for (2). This not only disables vmovaps but essentially all SIMD instructions (the O0 one is even worse than that). I would suggest to try (1) first for performance reason. If (1) turns out not possible, I think @isadikov's approach in #4129 is still better than (2).

majetideepak commented 1 year ago

@Yuhta The trade-off here would be space wasted (due to alignment) (1) vs. less vectorization (2). So (1) would not necessarily be performant. O0 is definitely not desirable. I quickly checked to see if that helps to avoid vmovaps and it does. The approach in #4129 solves this issue by using two int64_t instead of int128_t inside LongDecimalWithOverflowState. This approach has the buildInt128(), UPPER(), and LOWER overheads. It will also not help other uses of int128_t (UnscaledLongDecimal). Since we cannot predict where the compiler can introduce 16-byte aligned instructions for int128_t types, from an API standpoint, I want the implementation to pick either int128_t or two int64_t for all long decimal types. As mentioned above, I will check for other options to avoid 16-byte aligned load instructions. Maybe you can check with Orri or others internally at Meta if there is a less intrusive way.

Yuhta commented 1 year ago

When int128_t is allocated by compiler, it is guaranteed to be aligned properly. Alignment is a problem only when we do deserialization, and it's not specific to int128_t; any unaligned data access potentially can have this problem. That's why I prefer to fix accumulatorAlignmentSize instead of finding workaround just for int128_t.

majetideepak commented 1 year ago

Unaligned data access is only a problem if the compiler assumes unaligned memory to be aligned and then ends up generating instructions that require aligned memory. So far, we do not see these segfaults for other data types. So I believe this is specific to int128_t type.

Yuhta commented 1 year ago

I have seen it when we use something like sfmt19937; actually I have some idea how to fix the alignment problem, let me try something quickly and see if it works.

mbasmanova commented 1 year ago

@majetideepak Deepak, thank you for looking into this. Just for my own understanding, are you thinking of switching physical representation of long decimal from int128_t to a struct of two int64_t?

Yuhta commented 1 year ago

@majetideepak The fix for Aggregate::accumulatorAlignmentSize() is at #4268 , can you check if this fixes the crash with decimal accumulator?

majetideepak commented 1 year ago

are you thinking of switching physical representation of long decimal from int128_t to a struct of two int64_t?

@mbasmanova , in my understanding, GCC is assuming that the int128_t addresses are always aligned by 16 bytes and is generating instructions that require this alignment, eg. vmovaps as described above. Because of this, any pointer arithmetic in the Velox code on int128_t types must align the addresses to 16 bytes. @Yuhta has a fix for the accumulator code in https://github.com/facebookincubator/velox/pull/4268 But we have to ensure future features also honor this.

The other option is to make GCC not generate instructions that require this alignment and we have to look at some GCC attributes to do this with minimum disruption. We have to tell GCC that the int128_t addresses are not aligned by 16 bytes. I looked into some options, but could not find a good one yet.

The last and least preferred option in my opinion is to use two int64_t and create essentially a int128_t library.

majetideepak commented 1 year ago

The fix for Aggregate::accumulatorAlignmentSize() is at https://github.com/facebookincubator/velox/pull/4268, can you check if this fixes the crash with decimal accumulator?

@Yuhta, I commented in that PR.

Yuhta commented 1 year ago

@majetideepak If you can make the compiler to generate vmovups instead of vmovaps that would be also nice.

Yuhta commented 1 year ago

Also even if we have to give up int128_t in serialization, I would suggest using char[16] instead of int64_t[2] to replace them. This will work even if it is not aligned at 8 bytes, and makes it more clear that it is a serialization, and we can load it into register using something like vmovups.

mbasmanova commented 1 year ago

@majetideepak Deepak, thank you for clarifying.

isadikov commented 1 year ago

In a hypothetical scenario of int128_t serialisation not working, I would recommend going with int64_t[2] - we already have all of the code and infra ready to work with it. I think if int64_t alignment does not work on a particular system, then one would have much bigger problems than just decimal support, probably half of the code would not work.

Yuhta commented 1 year ago

@isadikov 8 bytes alignment is probably working in the current places where we do in-memory serialization, so probably not a concern. There is no essential difference at runtime though, it's a memory region of 16 bytes. We need to load it into one single register for any further processing. What we don't want is the temptation to load it into 2 registers and treat them as 2 int64_t, that's the main rationale behind my suggestion.

majetideepak commented 1 year ago

we already have all of the code and infra ready to work with it.

@isadikov If you are referring to the approach in https://github.com/facebookincubator/velox/pull/4129, then we will incur the buildInt128(), UPPER(), and LOWER() overheads. memcpy into an int128_t variable and memcpy back is more efficient I guess. This is what @Yuhta is referring to probably.

FelixYBW commented 1 year ago

We still hit issue now when we convert Arrow Decimal to Velox. The buffer may not be 16Byte aligned. Arrow uses 2 int64_t. Is there any reason we can't use it? If we define

struct Decimal{ uint64_t a; uint64_t b; }; Decimal A,B; A=B; GCC will still use SIMD instruction but switch to movdqu. In current processor, movedqu and movdqa have the same performance if address is 16B aligned.

We need to fix the issue, it's really hard to debug when a core dump is caused by this.

Yuhta commented 1 year ago

@FelixYBW The load is not a problem (we maybe need explicit intrinsic calls though). It's when you do arithmetics on it, you lose the SIMD (or some shortcut in hardware that's not really SIMD). We can fix the alignment in arrow conversion. Do you have some sample data?

FelixYBW commented 1 year ago

Do you mean if we use 2xint64_t gcc can't generate SIMD instruction for arithmetics, so we may have to use explicit intrinsic. If so let's keep current 128bit.

Yes, we hit the issue in Gluten. Two ways to fix the alignment during arrow conversion, one is to copy value to an aligned buffer, another one is to make sure the buffer from arrow is 16B aligned always. We are trying the second solution now.

Yuhta commented 1 year ago

@FelixYBW The second way would be optimal. All arrow buffers are aligned at 64 bytes so this happens only if you are storing other types of data in the same buffer. In this case you need some padding. We can also add some check in conversion to make sure the alignment is enforced.

facebookincubator / velox

Portability issues with int128 type #2388