Open rocallahan opened 4 months ago
Do you have any evidence this causes an issue?
No, not for normal builds. If you build klayout with UBSAN it will fail here. And there is always a possibility that some compiler, at some optimization level, will perform an optimization that miscompiles this code.
But why not just use memcpy
? That's certainly simpler and probably faster.
It used to be memcpy initially, but as KLayout builds in a lot of platforms, on some memcpy was significantly slower. I just don't recall which. This decision is ages old.
But I don't agree with the alignment issue. Except some outdated platform like the first ARMs or SPARC, no modern architecture has trouble with misaligned memory access. If that was the case, a lot more code wouldn't work.
It used to be memcpy initially, but as KLayout builds in a lot of platforms, on some memcpy was significantly slower. I just don't recall which. This decision is ages old.
If I gather some performance data that shows memcpy
beating fast_copy
on across a range of CPUs, would you consider switching?
But I don't agree with the alignment issue. Except some outdated platform like the first ARMs or SPARC, no modern architecture has trouble with misaligned memory access. If that was the case, a lot more code wouldn't work.
Yes, modern CPUs handle unaligned accesses (though there is often a performance penalty). But in C and C++ undefined behavior is inherently dangerous because over time compilers rely more and more on the assumption that undefined behavior is never triggered. See "Working" code that uses undefined behavior can "break" as the compiler evolves or changes for some background info.
I think you can save the effort.
I tested memcpy vs. "fast_copy" on gcc 11.4 under Ubuntu 22 and if there is any difference in performance, it was less than a few percent - even in the case of misalignment. Maybe the slow case was MSVC (as often), but I don't want to try every combination just to settle this discussion.
If misalignment became an issue, then the first victim was this guy: https://github.com/KLayout/klayout/blob/master/src/plugins/streamers/gds2/db_plugin/dbGDS2Reader.cc. Here, misaligned reads of 32bit values happen all the time. It is simply convenient and fast. I'd be more worried about the endianess issue, but the relevant architectures I target are little endian.
Matthias
I tested memcpy vs. "fast_copy" on gcc 11.4 under Ubuntu 22 and if there is any difference in performance, it was less than a few percent - even in the case of misalignment.
InputStream::copy_to
copies in 64K chunks so I looked at the 64K case with this benchmark:
https://gist.github.com/rocallahan/bf2d3191e4b5814b12dee9b0569aba3e
I tried to bias the benchmark in favor of fast_copy
by coding it to give the compiler as much information as possible about alignment, buffer size, and the function called so the compiler can inline and optimize fast_copy
as much as possible, whereas the memcpy
case is always a non-inlined call to the glibc function. (I verified that by checking the assembly.) There's a "everything aligned" case and a "bad alignment" case.
I ran this on my laptop. The CPU is "11th Gen Intel(R) Core(TM) i7-1185G7" i.e. Intel Tiger Lake. I used Debian packaged clang 16 (Debian clang version 16.0.6 (26)
), gcc 13.2 (g++ (Debian 13.2.0-13) 13.2.0
), glibc 2.38-7. Results:
rocallahan@rocallahan:~$ clang++ -g -O2 -o /tmp/memcpy-bench /tmp/memcpy-bench.cc && /tmp/memcpy-bench
memcpy src_offset=0 dst_offset=0 buffer_size=65536: 3037ns per copy
fast_copy src_offset=0 dst_offset=0 buffer_size=65536: 3303ns per copy
memcpy src_offset=1 dst_offset=3 buffer_size=65536: 3098ns per copy
fast_copy src_offset=1 dst_offset=3 buffer_size=65536: 4723ns per copy
rocallahan@rocallahan:~$ g++ -g -O2 -o /tmp/memcpy-bench /tmp/memcpy-bench.cc && /tmp/memcpy-bench
memcpy src_offset=0 dst_offset=0 buffer_size=65536: 3111ns per copy
fast_copy src_offset=0 dst_offset=0 buffer_size=65536: 5780ns per copy
memcpy src_offset=1 dst_offset=3 buffer_size=65536: 3101ns per copy
fast_copy src_offset=1 dst_offset=3 buffer_size=65536: 7533ns per copy
There is some noise from run to run but not enough to affect the conclusions. clang does a significantly better job than gcc of optimizing fast_copy
but glibc's memcpy always wins and, except for clang + everything aligned, it wins by a lot. (In the clang + everything aligned case, clang manages to vectorize fast_copy
.)
In
tlStream.cc
fast_copy
doesand there's no guarantee that
tl
andsl
are correctly aligned. This is undefined behavior in C and C++.Why didn't you just use
memcpy
here?