Open majetideepak opened 2 years ago
@mbasmanova @Yuhta @xiaoxmeng Since you are seeing most of these issues in your internal tests, can you share your thoughts?
So far I don't see these 2 limitations as showstopper.
For 1 we just need to use the new mul
defined in DecimalUtil.h
.
For 2 we need to be careful and annotate the alignment (__attribute__ ((aligned (16)))
or alignas(16)
) wherever needed.
My preference would be to switch to 2 64-bit integers, but I'm fine continuing with int128 for some more time to see if things get better.
FYI I probably found the root cause of 1. It's the Meta internal system that does not link clang compiler_rt
by default. Once I add this, no more annotation will be needed for this purpose.
Bad news, due to some conflicts with Rust code, we probably still need to keep the annotations. I am going to give it a final try to fix our clang next week, if failed we still need the annotation.
Just an update on this, I get some internal point of contact who will be fixing the Rust build system so that we can link compiler builtins properly. I will put some placeholder/slow implementation this week in CheckedArithematic.h
to unblock the PRs for now (annotation is still needed for now).
Once everything is sorted out, no annotations will be needed and we will use __builtin_mul_overflow
with __int128
in CheckedArithematic.h
.
Issue 1 is solved. There is no longer need to exclude the sanitizer. For 2 we should experiment annotate UnscaledLongDecimal
and see if the alignment can be transitively specified.
We are still seeing segfaults on CentOS Stream 8 release build which seem to be due to unaligned access. See PRs https://github.com/facebookincubator/velox/pull/3755 and https://github.com/facebookincubator/velox/pull/4129/file However, a simple reproducer with unaligned access seems to work fine on this machine.
char* buf1 = reinterpret_cast<char*>(malloc(160));
buf1[1] = 1;
buf1[19] = 2;
__int128_t* val1 = reinterpret_cast<__int128_t*>(buf1 + 1);
__int128_t* val2 = reinterpret_cast<__int128_t*>(buf1 + 19);
__int128_t result = val1[0] + val2[0];
if (result == 3) {
std::cout << "Success" << std::endl;
}
I will investigate this further.
@majetideepak Try to reproduce it within a loop. Usually the alignment problem only shows up when vectorized
Will you consider using two int64 to express UnscaledLongDecimal in the future?
Will you consider using two int64 to express UnscaledLongDecimal in the future?
That's a good question. @majetideepak Deepak, looks like we continue seeing issue with int128_t. Should we switch to a struct with 2 64-bit integers?
@mbasmanova My recommendation would be to continue to use the int128_t type to get the best possible platform-specific implementations. int128_t types have been supported for a while now by GCC and Clang. So it should be available in most user settings given Velox also requires C++17 support. The only issue we see so far with 128_t is the requirement for alignment for the corresponding SIMD instructions using GCC. If we can figure out a way for the compiler to generate unaligned instructions, we should be good. I am looking into this further now.
@liujiayi771 is there any reason to consider using two int64 values for UnscaledLongDecimal?
The only issue we see so far with 128_t is the requirement for alignment for the corresponding SIMD instructions.
PR #3755 suggests that we are not able to reinterpret_cast char to in128_t even when memory is properly aligned. I guess we need to dig in more to understand the root cause.
reinterpret_cast
is only a compiler directive, so there is no instruction associated. We also see these crashes only on GCC.
At level O3, the code generated is highly optimized and only way is to look at the generated code. I plan to do this this week and will post updates here. A simple reproducer would help a lot too, but I don't have one yet.
@mbasmanova My recommendation would be to continue to use the int128_t type to get the best possible platform-specific implementations. int128_t types have been supported for a while now by GCC and Clang. So it should be available in most user settings given Velox also requires C++17 support. The only issue we see so far with 128_t is the requirement for alignment for the corresponding SIMD instructions using GCC. If we can figure out a way for the compiler to generate unaligned SIMD instructions, we should be good. I am looking into this further now.
@liujiayi771 is there any reason to consider using two int64 values for UnscaledLongDecimal?
I test the average aggregation of decimal, and get core dump in DecimalAggregate::initializeNewGroups. I think the reason for the core dump is to construct LongDecimalWithOverflowState. I refer to this PR and modify the code. There will be no error in constructing LongDecimalWithOverflowState.
The assembly instruction here is vmovaps
and the SO answer describes the need for it to be aligned to 16 bytes.
https://stackoverflow.com/questions/67243284/why-movaps-causes-segmentation-fault
0x21f9d02 <_ZN8facebook5velox9aggregate16DecimalAggregateINS0_20UnscaledShortDecimalES3_E19initializeNewGroupsEPPcN5folly5RangeIPKiEE+98> vmovaps %xmm0,(%r11)
The fix would be one of the following
1) Ensure we can write code such that all loads are aligned to Aggregate::accumulatorAlignmentSize()
2) Enable the GCC compiler to disable generating vectorized loads that need alignment for int128_t types.
(2) can be achieved by either of the following approaches wherever int128_t values are being loaded.
#if defined (__GNUC__) && (defined (__x86_64__) || defined (__i386__))
__attribute__ ((target ("no-sse")))
#endif
#pragma GCC push_options
#pragma GCC optimize ("O0")
code
#pragma GCC pop_options
use volatile
I feel (1) is hard to achieve. (2) seems to be simpler and specific to int128_t types.
I will work towards adding support for (2) if there are no objections. I will look for better compiler options to avoid vmovaps
@majetideepak We are likely to take performance hit for (2). This not only disables vmovaps
but essentially all SIMD instructions (the O0
one is even worse than that). I would suggest to try (1) first for performance reason. If (1) turns out not possible, I think @isadikov's approach in #4129 is still better than (2).
@Yuhta The trade-off here would be space wasted (due to alignment) (1) vs. less vectorization (2). So (1) would not necessarily be performant. O0
is definitely not desirable. I quickly checked to see if that helps to avoid vmovaps
and it does.
The approach in #4129 solves this issue by using two int64_t
instead of int128_t
inside LongDecimalWithOverflowState
. This approach has the buildInt128()
, UPPER()
, and LOWER
overheads.
It will also not help other uses of int128_t
(UnscaledLongDecimal).
Since we cannot predict where the compiler can introduce 16-byte aligned instructions for int128_t
types,
from an API standpoint, I want the implementation to pick either int128_t
or two int64_t
for all long decimal types.
As mentioned above, I will check for other options to avoid 16-byte aligned load instructions.
Maybe you can check with Orri or others internally at Meta if there is a less intrusive way.
When int128_t
is allocated by compiler, it is guaranteed to be aligned properly. Alignment is a problem only when we do deserialization, and it's not specific to int128_t
; any unaligned data access potentially can have this problem. That's why I prefer to fix accumulatorAlignmentSize
instead of finding workaround just for int128_t
.
Unaligned data access is only a problem if the compiler assumes unaligned
memory to be aligned
and then ends up generating instructions that require aligned memory.
So far, we do not see these segfaults for other data types. So I believe this is specific to int128_t
type.
I have seen it when we use something like sfmt19937
; actually I have some idea how to fix the alignment problem, let me try something quickly and see if it works.
@majetideepak Deepak, thank you for looking into this. Just for my own understanding, are you thinking of switching physical representation of long decimal from int128_t to a struct of two int64_t?
@majetideepak The fix for Aggregate::accumulatorAlignmentSize()
is at #4268 , can you check if this fixes the crash with decimal accumulator?
are you thinking of switching physical representation of long decimal from int128_t to a struct of two int64_t?
@mbasmanova , in my understanding, GCC is assuming that the int128_t addresses are always aligned by 16 bytes and is generating instructions that require this alignment, eg. vmovaps
as described above.
Because of this, any pointer arithmetic in the Velox code on int128_t types must align the addresses to 16 bytes. @Yuhta has a fix for the accumulator code in https://github.com/facebookincubator/velox/pull/4268
But we have to ensure future features also honor this.
The other option is to make GCC not generate instructions that require this alignment and we have to look at some GCC attributes to do this with minimum disruption. We have to tell GCC that the int128_t addresses are not aligned by 16 bytes. I looked into some options, but could not find a good one yet.
The last and least preferred option in my opinion is to use two int64_t
and create essentially a int128_t library.
The fix for Aggregate::accumulatorAlignmentSize() is at https://github.com/facebookincubator/velox/pull/4268, can you check if this fixes the crash with decimal accumulator?
@Yuhta, I commented in that PR.
@majetideepak If you can make the compiler to generate vmovups
instead of vmovaps
that would be also nice.
Also even if we have to give up int128_t
in serialization, I would suggest using char[16]
instead of int64_t[2]
to replace them. This will work even if it is not aligned at 8 bytes, and makes it more clear that it is a serialization, and we can load it into register using something like vmovups
.
@majetideepak Deepak, thank you for clarifying.
In a hypothetical scenario of int128_t serialisation not working, I would recommend going with int64_t[2] - we already have all of the code and infra ready to work with it. I think if int64_t alignment does not work on a particular system, then one would have much bigger problems than just decimal support, probably half of the code would not work.
@isadikov 8 bytes alignment is probably working in the current places where we do in-memory serialization, so probably not a concern. There is no essential difference at runtime though, it's a memory region of 16 bytes. We need to load it into one single register for any further processing. What we don't want is the temptation to load it into 2 registers and treat them as 2 int64_t
, that's the main rationale behind my suggestion.
we already have all of the code and infra ready to work with it.
@isadikov If you are referring to the approach in https://github.com/facebookincubator/velox/pull/4129, then we will incur the buildInt128(), UPPER(), and LOWER() overheads. memcpy
into an int128_t variable and memcpy
back is more efficient I guess. This is what @Yuhta is referring to probably.
We still hit issue now when we convert Arrow Decimal to Velox. The buffer may not be 16Byte aligned. Arrow uses 2 int64_t. Is there any reason we can't use it? If we define
struct Decimal{ uint64_t a; uint64_t b; }; Decimal A,B; A=B;
GCC will still use SIMD instruction but switch to movdqu. In current processor, movedqu and movdqa have the same performance if address is 16B aligned.
We need to fix the issue, it's really hard to debug when a core dump is caused by this.
@FelixYBW The load is not a problem (we maybe need explicit intrinsic calls though). It's when you do arithmetics on it, you lose the SIMD (or some shortcut in hardware that's not really SIMD). We can fix the alignment in arrow conversion. Do you have some sample data?
Do you mean if we use 2xint64_t gcc can't generate SIMD instruction for arithmetics, so we may have to use explicit intrinsic. If so let's keep current 128bit.
Yes, we hit the issue in Gluten. Two ways to fix the alignment during arrow conversion, one is to copy value to an aligned buffer, another one is to make sure the buffer from arrow is 16B aligned always. We are trying the second solution now.
@FelixYBW The second way would be optimal. All arrow buffers are aligned at 64 bytes so this happens only if you are storing other types of data in the same buffer. In this case you need some padding. We can also add some check in conversion to make sure the alignment is enforced.
The current Velox Long Decimal type uses
int128_t
type. However, we are seeing a couple of portability issues around int128 types. Some of them are: 1) Address sanitizer does not support int128_t well. 2) int128_t requires alignment on x86_64.I want to discuss these issues further and would like to conclude if it is meaningful to continue to use this type for development or use 2 int64_t type values.