Closed nemequ closed 9 years ago
I saw that this pointer alignment requirement is actually well-hidden in the C reference... It's quite concerning because fixing it will require a good amount of memcpy I'm afraid, checking if the address is aligned would make things even slower so I guess memcpy is the best option. That's going to generate a performance hit for sure. Do you have any idea on your side about a fast workaround ?
It seems like there really should be a way to get the C compiler to basically do what it is doing now if the hardward supports unaligned loads/stores, but fall back on loading/storing individual bytes on hardware which doesn't support it (MIPS?). I haven't found anything, though. I was thinking about asking on SO… I'll try to do that later today, I'll let you know how it goes.
If there is somewhere you basically want a fast memcpy so you use int64_t or something instead of uint8_t, I was thinking it might be possible to use the new OpenMP SIMD support. Something like
void not_memcpy (uint8_t* dest, uint8_t src, size_t size) {
#pragma omp simd safelen(???)
for (size_t i = 0 ; i < size ; i++)
dest[i] = src[i];
}
I haven't tried it to see if there is any speedup, though, and obviously it would require OpenMP 4.0 for there to be one. It would probably be okay to omit safelen for a memcpy
replacement, but for an memmove
replacement it would take a bit of thought… if you're replacing a loop on uint64_t
it could obviously be at least sizeof(uint64_t)
, but I expect it would be much better if it were at least 16, though 32 or 64 would be much better.
FWIW, my current understanding is that unaligned store/load is basically free on modern x86/x86_64 CPUs, the main danger is that the CC will auto-vectorize the code and unaligned access will trap on vectors. On ARM the situation is similar, except I believe there is a significant penalty for unaligned access. MIPS doesn't support unaigned access… the CPU will trap them, and by default Linux will currently catch that and emulate the request using safe instructions—cost of the safer instructions aside (basically loading uint8_t
s and shifting/oring together a value), it's very expensive because of the whole trap/catch/retry thing.
http://fastcompression.blogspot.com/2014/11/portability-woes-endianess-and.html has some good ideas, though it doesn't do much about the auto-vectorization concern…
Thanks for your replies, I checked the link which is very informative. I like the SIMD idea for memcpy, and apparently it is already implemented on some platforms (osx). It does however require openmp 4 as you say ... Probably too restrictive for now. After further thinking, I think I might be able to find a very fast workaround for chameleon and cheetah encoding/decoding, but for lion things are different and I don't see a fast enough solution just yet.
I like the SIMD idea for memcpy, and apparently it is already implemented on some platforms (osx). It does however require openmp 4 as you say ... Probably too restrictive for now.
I'm not sure what you mean here—if you're talking about memcpy using SIMD, all platforms should be doing that. However, the memcpy library function has some overhead as it will take some time to determine what method to use (depending on things like alignment). It's great for larger operations, but for smaller ones it is pretty expensive. That said, most compilers will actually inline many memcpy calls, especially for smaller buffers with sizes that are know at compile-time, so if you can use fixed sizes memcpy would probably be fairly snappy. GCC has a __builtin_memcpy, but AFAIK it's unnecessary unless you compile with -fno-builtins
.
If you're talking about OS X (i.e. clang) supporting OpenMP 4.0, it doesn't—hopefully the next version of clang will. GCC does since 4.9. That said, you don't even have to put an ifdef around it… if the compiler doesn't support OpenMP 4 it will still work, it just will not use SIMD (unless the C compiler does it). If you want to take a vastly different approach when OpenMP 4.0 isn't available you can always use #if defined(_OPENMP) && (_OPENMP >= 201307)
I'm not sure what you mean here—if you're talking about memcpy using SIMD, all platforms should be doing that. However, the memcpy library function has some overhead as it will take some time to determine what method to use (depending on things like alignment). It's great for larger operations, but for smaller ones it is pretty expensive. That said, most compilers will actually inline many memcpy calls, especially for smaller buffers with sizes that are know at compile-time, so if you can use fixed sizes memcpy would probably be fairly snappy. GCC has a __builtin_memcpy, but AFAIK it's unnecessary unless you compile with -fno-builtins.
I'll perform a few tests later on : memcpy vs direct copy by using uint types (unsafe due to alignment issues) vs openmp copies, on OS X.
If you're talking about OS X (i.e. clang) supporting OpenMP 4.0, it doesn't—hopefully the next version of clang will. GCC does since 4.9. That said, you don't even have to put an ifdef around it… if the compiler doesn't support OpenMP 4 it will still work, it just will not use SIMD (unless the C compiler does it). If you want to take a vastly different approach when OpenMP 4.0 isn't available you can always use #if defined(_OPENMP) && (_OPENMP >= 201307)
I was talking about this project : https://github.com/clang-omp/clang Very simple to deploy on OS X and it offers the omp simd pragma for clang. I'll check it out.
I was talking about this project : https://github.com/clang-omp/clang Very simple to deploy on OS X and it offers the omp simd pragma for clang. I'll check it out.
AFAIK that is the project they're trying to merge into clang. Unfortunately this has been going on for several years.
Okay, I created a small code snip to test all of this :
#include <omp.h>
#include <stdio.h>
#include <strings.h>
#include <sys/resource.h>
#define MAX_SIZE (1 << 24)
#define MICROSECONDS 1000000.0
void method_memcpy(const unsigned char* input, unsigned char* output, const unsigned int size, const unsigned int iterations) {
for(unsigned int j = 0; j < iterations; j ++) {
memcpy(output, input, size);
*(output + j % size) = 123;
}
}
void method_byte_to_byte_copy(const unsigned char* input, unsigned char* output, const unsigned int size, const unsigned int iterations) {
for(unsigned int j = 0; j < iterations; j ++) {
for(unsigned int i = 0; i < size; i ++)
*(output + i) = *(input + i);
*(output + j % size) = 123;
}
}
void method_simd_copy(const unsigned char* input, unsigned char* output, const unsigned int size, const unsigned int iterations) {
for(unsigned int j = 0; j < iterations; j ++) {
#pragma omp simd
for(unsigned int i = 0; i < size; i ++)
*(output + i) = *(input + i);
*(output + j % size) = 123;
}
}
void method_unsafe_8byte_to_8byte_copy(const unsigned char* input, unsigned char* output, const unsigned int size, const unsigned int iterations) {
for(unsigned int j = 0; j < iterations; j ++) {
for(unsigned int i = 0; i < (size / sizeof(uint64_t)); i ++)
*((uint64_t*)output + i) = *((uint64_t*)input + i);
output[j % size] = 123;
}
}
void method_unsafe_4byte_to_4byte_copy(const unsigned char* input, unsigned char* output, const unsigned int size, const unsigned int iterations) {
for(unsigned int j = 0; j < iterations; j ++) {
for(unsigned int i = 0; i < (size / sizeof(uint32_t)); i ++)
*((uint32_t*)output + i) = *((uint32_t*)input + i);
output[j % size] = 123;
}
}
void method_unsafe_2byte_to_2byte_copy(const unsigned char* input, unsigned char* output, const unsigned int size, const unsigned int iterations) {
for(unsigned int j = 0; j < iterations; j ++) {
for(unsigned int i = 0; i < (size / sizeof(uint16_t)); i ++)
*((uint16_t*)output + i) = *((uint16_t*)input + i);
output[j % size] = 123;
}
}
void output_result(const char* title, const struct timeval* start, const struct timeval* stop, const unsigned int size, const unsigned int iterations, const unsigned int bogus) {
double elapsed = ((stop->tv_sec * MICROSECONDS + stop->tv_usec) - (start->tv_sec * MICROSECONDS + start->tv_usec)) / MICROSECONDS;
printf("%s\tsize = %d, iterations = %d, time = %3lfs, bogus = %i\n", title, size, iterations, elapsed, bogus);
}
int main() {
unsigned char* input = malloc(MAX_SIZE * sizeof(unsigned char));
unsigned char* output = malloc(MAX_SIZE * sizeof(unsigned char));
struct rusage usage;
for(unsigned int i = 0; i < MAX_SIZE; i ++)
*(input + i) = (unsigned char)i;
unsigned int iterations = (1 << 30);
for(unsigned int size = 2; size <= MAX_SIZE; size = size << 1, iterations = iterations >> 1) {
// Memcpy
getrusage(RUSAGE_SELF, &usage);
struct timeval start = usage.ru_utime;
method_memcpy((const unsigned char*)input, output, size, iterations);
getrusage(RUSAGE_SELF, &usage);
struct timeval stop = usage.ru_utime;
output_result("MEMCPY\t\t", &start, &stop, size, iterations, (unsigned int)output[123 % size]);
// Byte to byte
getrusage(RUSAGE_SELF, &usage);
start = usage.ru_utime;
method_byte_to_byte_copy((const unsigned char*)input, output, size, iterations);
getrusage(RUSAGE_SELF, &usage);
stop = usage.ru_utime;
output_result("BYTE TO BYTE\t", &start, &stop, size, iterations, (unsigned int)output[123 % size]);
// SIMD copy
getrusage(RUSAGE_SELF, &usage);
start = usage.ru_utime;
method_simd_copy((const unsigned char*)input, output, size, iterations);
getrusage(RUSAGE_SELF, &usage);
stop = usage.ru_utime;
output_result("BYTE TO BYTE SIMD", &start, &stop, size, iterations, (unsigned int)output[123 % size]);
// 8 byte to 8 byte
getrusage(RUSAGE_SELF, &usage);
start = usage.ru_utime;
method_unsafe_8byte_to_8byte_copy((const unsigned char*)input, output, size, iterations);
getrusage(RUSAGE_SELF, &usage);
stop = usage.ru_utime;
output_result("8 BYTES TO 8 BYTES", &start, &stop, size, iterations, (unsigned int)output[123 % size]);
// 4 byte to 4 byte
getrusage(RUSAGE_SELF, &usage);
start = usage.ru_utime;
method_unsafe_4byte_to_4byte_copy((const unsigned char*)input, output, size, iterations);
getrusage(RUSAGE_SELF, &usage);
stop = usage.ru_utime;
output_result("4 BYTES TO 4 BYTES", &start, &stop, size, iterations, (unsigned int)output[123 % size]);
// 2 byte to 2 byte
getrusage(RUSAGE_SELF, &usage);
start = usage.ru_utime;
method_unsafe_2byte_to_2byte_copy((const unsigned char*)input, output, size, iterations);
getrusage(RUSAGE_SELF, &usage);
stop = usage.ru_utime;
output_result("2 BYTES TO 2 BYTES", &start, &stop, size, iterations, (unsigned int)output[123 % size]);
}
free(input);
free(output);
}
To launch it, I installed the latest clang-omp :
$ /usr/local/bin/clang-omp --version clang version 3.5.0 Target: x86_64-apple-darwin14.3.0 Thread model: posix
Compilation was done with the following command :
$ /usr/local/bin/clang-omp -I/usr/local/include/libiomp/ -fopenmp -Ofast -fomit-frame-pointer -flto copy_study.c -o copy
Here is the resulting data as a table for reference (top row indicates sizes in bytes) :
And here is the resulting graph :
A few things clearly stand out :
And a few "strange" things are seen and can be discarded :
Fixed in 240088c110180afc921aeac9e795b2297cf62fcb
ubsan detects a lot of undefined stores/loads:
I only tested chameleon there, but it's probably a good bet that cheetah and lion have similar issues.