DLTcollab / sse2neon

A translator from Intel SSE intrinsics to Arm/Aarch64 NEON implementation
MIT License
1.3k stars 208 forks source link

Implement `__rdtsc` #530

Closed Cuda-Chen closed 2 years ago

Cuda-Chen commented 2 years ago

Currently I am implementing the _rdtsc Intel intrinsic function. I confirm I follow the instruction of adding test case. However, whenever I tried to make check on Intel platform, I always receive the following compile error:

$ make check
g++ -o tests/binding.o -Wall -Wcast-qual -I. -maes -mpclmul -mssse3 -msse4.2 -std=gnu++14 -c -MMD -MF tests/binding.o.d tests/binding.cpp
g++ -o tests/common.o -Wall -Wcast-qual -I. -maes -mpclmul -mssse3 -msse4.2 -std=gnu++14 -c -MMD -MF tests/common.o.d tests/common.cpp
g++ -o tests/impl.o -Wall -Wcast-qual -I. -maes -mpclmul -mssse3 -msse4.2 -std=gnu++14 -c -MMD -MF tests/impl.o.d tests/impl.cpp
tests/impl.cpp: In function ‘SSE2NEON::result_t SSE2NEON::test_rdtsc(const SSE2NEON::SSE2NEONTestImpl&, uint32_t)’:
tests/impl.cpp:9402:22: error: ‘_rdtsc’ was not declared in this scope; did you mean ‘it_rdtsc’?
 9402 |     uint64_t start = _rdtsc();
      |                      ^~~~~~
      |                      it_rdtsc

For your convenience, I attach the changed I have made:

elif defined(aarch64)

FORCE_INLINE uint64_t _rdtsc(void) { uint64_t val;

/* According to ARM DDI 0487F.c, from Armv8.0 to Armv8.5 inclusive, the
 * system counter is at least 56 bits wide; from Armv8.6, the counter
 * must be 64 bits wide.  So the system counter could be less than 64
 * bits wide and it is attributed with the flag 'cap_user_time_short'
 * is true.
 */
asm volatile("mrs %0, cntvct_el0" : "=r"(val));

return val; 

}

endif


- tests/impl.h
```cpp
#define INTRIN_FOREACH(TYPE)         \
...
    TYPE(rdtsc)                      \

At last, thanks for your help!

jserv commented 2 years ago

You don't have to implement _rdtsc for x86/x86-64 since the intrinsic should be available via the inclusion of <x86intrin.h>. Instead, header sse2neon.h should provide Arm/Aarch64 counterpart.

jserv commented 2 years ago

For ARMv7-A implementation of _rdtsc, you can check gperftools/src/base/cycleclock.h. Quoted:

V7 is the earliest arch that has a standard cyclecount

Related discussions: https://stackoverflow.com/questions/40454157/is-there-an-equivalent-instruction-to-rdtsc-in-arm

Cuda-Chen commented 2 years ago

Thanks @jserv help, and I will work on ARMv7-A part!

Cuda-Chen commented 2 years ago

Currently I am porting _rdtsc x86 intrinsic onto ARMv7. On ARMv7 platform, usually we can access PMCCNTR to get cycle count. In order to access this register, the program has to run in PL1 or high mode, or running in user mode when PMUSERENR.EN == 1. However, the PMUSERENR is set to zero in the test suite qemu environment and I can't change the value because the test suite qemu environment is running in user mode.

As such, I come up with the following two solutions, and I would like to know which solution is acceptable to this project:

  1. Set test suite qemu environment to privilege mode.
  2. Fallback to call syscall such as gettimeofday() if we can't set the value of PMUSERENR (in Linux kernel, this kind of syscall is able to access PMCCNTR).
jserv commented 2 years ago

As such, I come up with the following two solutions, and I would like to know which solution is acceptable to this project:

  1. Set test suite qemu environment to privilege mode.
  2. Fallback to call syscall such as gettimeofday() if we can't set the value of PMUSERENR (in Linux kernel, this kind of syscall is able to access PMCCNTR).

For Armv7-A targets, we can provide the OS-assisted fallback at first glance. Then, further exploration would be beneficial.