undefined behavior when `fast_copy` does misaligned reads and writes

rocallahan commented 4 months ago

In tlStream.cc fast_copy does

    while (n >= sizeof (unsigned long)) {
      *tl++ = *sl++;

and there's no guarantee that tl and sl are correctly aligned. This is undefined behavior in C and C++.

Why didn't you just use memcpy here?

klayoutmatthias commented 3 months ago

Do you have any evidence this causes an issue?

rocallahan commented 3 months ago

No, not for normal builds. If you build klayout with UBSAN it will fail here. And there is always a possibility that some compiler, at some optimization level, will perform an optimization that miscompiles this code.

But why not just use memcpy? That's certainly simpler and probably faster.

klayoutmatthias commented 3 months ago

It used to be memcpy initially, but as KLayout builds in a lot of platforms, on some memcpy was significantly slower. I just don't recall which. This decision is ages old.

But I don't agree with the alignment issue. Except some outdated platform like the first ARMs or SPARC, no modern architecture has trouble with misaligned memory access. If that was the case, a lot more code wouldn't work.

rocallahan commented 3 months ago

It used to be memcpy initially, but as KLayout builds in a lot of platforms, on some memcpy was significantly slower. I just don't recall which. This decision is ages old.

If I gather some performance data that shows memcpy beating fast_copy on across a range of CPUs, would you consider switching?

But I don't agree with the alignment issue. Except some outdated platform like the first ARMs or SPARC, no modern architecture has trouble with misaligned memory access. If that was the case, a lot more code wouldn't work.

Yes, modern CPUs handle unaligned accesses (though there is often a performance penalty). But in C and C++ undefined behavior is inherently dangerous because over time compilers rely more and more on the assumption that undefined behavior is never triggered. See "Working" code that uses undefined behavior can "break" as the compiler evolves or changes for some background info.

klayoutmatthias commented 3 months ago

I think you can save the effort.

I tested memcpy vs. "fast_copy" on gcc 11.4 under Ubuntu 22 and if there is any difference in performance, it was less than a few percent - even in the case of misalignment. Maybe the slow case was MSVC (as often), but I don't want to try every combination just to settle this discussion.

If misalignment became an issue, then the first victim was this guy: https://github.com/KLayout/klayout/blob/master/src/plugins/streamers/gds2/db_plugin/dbGDS2Reader.cc. Here, misaligned reads of 32bit values happen all the time. It is simply convenient and fast. I'd be more worried about the endianess issue, but the relevant architectures I target are little endian.

Matthias

rocallahan commented 3 months ago

I tested memcpy vs. "fast_copy" on gcc 11.4 under Ubuntu 22 and if there is any difference in performance, it was less than a few percent - even in the case of misalignment.

InputStream::copy_to copies in 64K chunks so I looked at the 64K case with this benchmark: https://gist.github.com/rocallahan/bf2d3191e4b5814b12dee9b0569aba3e I tried to bias the benchmark in favor of fast_copy by coding it to give the compiler as much information as possible about alignment, buffer size, and the function called so the compiler can inline and optimize fast_copy as much as possible, whereas the memcpy case is always a non-inlined call to the glibc function. (I verified that by checking the assembly.) There's a "everything aligned" case and a "bad alignment" case.

I ran this on my laptop. The CPU is "11th Gen Intel(R) Core(TM) i7-1185G7" i.e. Intel Tiger Lake. I used Debian packaged clang 16 (Debian clang version 16.0.6 (26)), gcc 13.2 (g++ (Debian 13.2.0-13) 13.2.0), glibc 2.38-7. Results:

rocallahan@rocallahan:~$ clang++ -g -O2 -o /tmp/memcpy-bench /tmp/memcpy-bench.cc && /tmp/memcpy-bench
memcpy src_offset=0 dst_offset=0 buffer_size=65536: 3037ns per copy
fast_copy src_offset=0 dst_offset=0 buffer_size=65536: 3303ns per copy
memcpy src_offset=1 dst_offset=3 buffer_size=65536: 3098ns per copy
fast_copy src_offset=1 dst_offset=3 buffer_size=65536: 4723ns per copy
rocallahan@rocallahan:~$ g++ -g -O2 -o /tmp/memcpy-bench /tmp/memcpy-bench.cc && /tmp/memcpy-bench
memcpy src_offset=0 dst_offset=0 buffer_size=65536: 3111ns per copy
fast_copy src_offset=0 dst_offset=0 buffer_size=65536: 5780ns per copy
memcpy src_offset=1 dst_offset=3 buffer_size=65536: 3101ns per copy
fast_copy src_offset=1 dst_offset=3 buffer_size=65536: 7533ns per copy

There is some noise from run to run but not enough to affect the conclusions. clang does a significantly better job than gcc of optimizing fast_copy but glibc's memcpy always wins and, except for clang + everything aligned, it wins by a lot. (In the clang + everything aligned case, clang manages to vectorize fast_copy.)

KLayout / klayout

undefined behavior when `fast_copy` does misaligned reads and writes #1802