VectorCamp / vectorscan

A portable fork of the high-performance regular expression matching library
https://www.vectorcamp.gr/project/vectorscan/
Other
509 stars 55 forks source link

HS_FLAG_UTF8 flag doesn't seem to work as expected on aarch64 platforms #135

Closed iavjkae closed 1 year ago

iavjkae commented 2 years ago

Here are my test result on aarch64

pattern: 星{2} scan text: 星星点灯 flags: HS_FLAG_UTF8 | HS_FLAG_SINGLEMATCH expect: matched actual: unmatched

on linux x86_64 platform this works fine


env

test code


  std::string pattern = u8"星{2}";
  hs_compile_error_t *error = nullptr;
  hs_database_t *db         = nullptr;
  int flags = HS_FLAG_UTF8 | HS_FLAG_SINGLEMATCH;

  hs_error_t ret = hs_compile(pattern.c_str(), flags, HS_MODE_BLOCK,
                              nullptr,&db,&error);
  if (ret != HS_SUCCESS) {
      std::cout << "hs_compile error msg: " << error->message << ". expression: " << error->expression << std::endl;
      hs_free_compile_error(error);
      error = nullptr;
  }
  ASSERT_EQ(ret, HS_SUCCESS);

  hs_scratch_t *scratch = nullptr;
  ret = hs_alloc_scratch(db, &scratch);
  ASSERT_EQ(ret, HS_SUCCESS);

  std::string scanText = u8"星星点灯";
  static int matchCount = 0;
  ret = hs_scan(db, scanText.c_str(), scanText.size(), 0, scratch,
                [](unsigned int id, unsigned long long from,
                   unsigned long long to, unsigned int flags,
                   void *context) -> int {
        printMatchedInfo(id, from, to, flags, context);
        matchCount++;
        return 0;
      },
      nullptr);

  EXPECT_EQ(ret, HS_SUCCESS);
  EXPECT_EQ(matchCount,1);

  // cleanup
  hs_free_database(db);
  hs_free_scratch(scratch);
  matchCount = 0;

cmake file

cmake_minimum_required(VERSION 3.16)
cmake_policy(SET CMP0074 NEW)
cmake_policy(SET CMP0115 NEW)
if(CMAKE_VERSION VERSION_GREATER_EQUAL "3.24.0")
    cmake_policy(SET CMP0135 NEW)
endif()
project(utf8-test)

set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
include(FetchContent)

FetchContent_Declare(vectorscan URL https://github.com/VectorCamp/vectorscan/archive/refs/tags/vectorscan/5.4.8.tar.gz)
FetchContent_MakeAvailable(vectorscan)
FetchContent_GetProperties(vectorscan)

FetchContent_Declare(googletest URL https://github.com/google/googletest/archive/refs/tags/release-1.12.1.tar.gz)
FetchContent_MakeAvailable(googletest)
find_package(Boost REQUIRED)

file(GLOB TEST_SRCS tests/*.cc)
add_executable(utf8-test ${TEST_SRCS})
target_include_directories(utf8-test PRIVATE ${vectorscan_SOURCE_DIR}/src)
target_link_libraries(utf8-test PUBLIC hs gtest gmock)

steps to reproduce

# on project dir
mkdir build
cd build
cmake ..
cmake --build .
./utf8-test

cmake output

-- The C compiler identification is GNU 9.5.0 -- The CXX compiler identification is GNU 9.5.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/local/bin/gcc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/local/bin/g++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done CMake Deprecation Warning at build/_deps/vectorscan-src/CMakeLists.txt:1 (cmake_minimum_required): Compatibility with CMake < 2.8.12 will be removed from a future version of CMake.

Update the VERSION argument value or use a ... suffix to tell CMake that the project does not need compatibility with older versions.

-- Performing Test ARCH_X86_64 -- Performing Test ARCH_X86_64 - Failed -- Performing Test ARCH_IA32 -- Performing Test ARCH_IA32 - Failed -- Performing Test ARCH_AARCH64 -- Performing Test ARCH_AARCH64 - Success -- Performing Test ARCH_ARM32 -- Performing Test ARCH_ARM32 - Failed -- Performing Test ARCH_PPC64EL -- Performing Test ARCH_PPC64EL - Failed -- Default build type 'Release with debug info' -- using release build -- Boost version: 1.67.0 -- Found Python: /usr/bin/python3.7 (found version "3.7.3") found components: Interpreter -- Build date: 2022-10-31 -- Building static libraries -- gcc version 9.5.0 CMake Warning at build/_deps/vectorscan-src/CMakeLists.txt:184 (message): Something went wrong determining gcc tune: -mtune=armv8-a not valid, falling back to -mtune=native

-- ARCH_C_FLAGS : -- ARCH_CXX_FLAGS : -- g++ version 9.5.0 -- Looking for include file unistd.h -- Looking for include file unistd.h - found -- Looking for C++ include arm_neon.h -- Looking for C++ include arm_neon.h - found -- Looking for posix_memalign -- Looking for posix_memalign - found -- Looking for _aligned_malloc -- Looking for _aligned_malloc - not found -- Performing Test HAS_C_HIDDEN -- Performing Test HAS_C_HIDDEN - Success -- Performing Test HAS_CXX_HIDDEN -- Performing Test HAS_CXX_HIDDEN - Success -- Looking for _LIBCPP_VERSION -- Looking for _LIBCPP_VERSION - not found -- generator is Unix Makefiles -- Performing Test HAS_C_ATTR_IFUNC -- Performing Test HAS_C_ATTR_IFUNC - Success -- Performing Test HAVE_NEON -- Performing Test HAVE_NEON - Success -- Performing Test HAVE_CC_BUILTIN_ASSUME_ALIGNED -- Performing Test HAVE_CC_BUILTIN_ASSUME_ALIGNED - Success -- Performing Test HAVE_CXX_BUILTIN_ASSUME_ALIGNED -- Performing Test HAVE_CXX_BUILTIN_ASSUME_ALIGNED - Success -- Performing Test HAVEBUILTIN_CONSTANT_P -- Performing Test HAVEBUILTIN_CONSTANT_P - Success -- Performing Test C_FLAG_Wvla -- Performing Test C_FLAG_Wvla - Success -- Performing Test C_FLAG_Wpointer_arith -- Performing Test C_FLAG_Wpointer_arith - Success -- Performing Test C_FLAG_Wstrict_prototypes -- Performing Test C_FLAG_Wstrict_prototypes - Success -- Performing Test C_FLAG_Wmissing_prototypes -- Performing Test C_FLAG_Wmissing_prototypes - Success -- Performing Test CXX_FLAG_Wvla -- Performing Test CXX_FLAG_Wvla - Success -- Performing Test CXX_FLAG_Wpointer_arith -- Performing Test CXX_FLAG_Wpointer_arith - Success -- Performing Test CC_SELF_ASSIGN -- Performing Test CC_SELF_ASSIGN - Failed -- Performing Test CXX_SELF_ASSIGN -- Performing Test CXX_SELF_ASSIGN - Failed -- Performing Test CC_PAREN_EQUALITY -- Performing Test CC_PAREN_EQUALITY - Failed -- Performing Test CXX_UNUSED_CONST_VAR -- Performing Test CXX_UNUSED_CONST_VAR - Success -- Performing Test CXX_IGNORED_ATTR -- Performing Test CXX_IGNORED_ATTR - Success -- Performing Test CXX_REDUNDANT_MOVE -- Performing Test CXX_REDUNDANT_MOVE - Success -- Performing Test CXX_WEAK_VTABLES -- Performing Test CXX_WEAK_VTABLES - Failed -- Performing Test CXX_MISSING_DECLARATIONS -- Performing Test CXX_MISSING_DECLARATIONS - Success -- Performing Test CXX_UNUSED_LOCAL_TYPEDEFS -- Performing Test CXX_UNUSED_LOCAL_TYPEDEFS - Success -- Performing Test CXX_WUNUSED_VARIABLE -- Performing Test CXX_WUNUSED_VARIABLE - Success -- Performing Test CC_STRINGOP_OVERFLOW -- Performing Test CC_STRINGOP_OVERFLOW - Success -- Building for current host CPU: -march=armv8-a -mtune=native -- Looking for mmap -- Looking for mmap - found -- Doxygen not found, unable to generate API reference -- Sphinx not found, unable to generate developer reference -- Found PkgConfig: /usr/bin/pkg-config (found version "0.29") -- Checking for module 'libpcre>=8.41' -- No package 'libpcre' found -- PCRE version 8.41 or above not found -- PCRE 8.41 or above not found -- Could not find libpcap - some examples will not be built -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE -- Found Boost: /usr/include (found version "1.67.0") -- Configuring done -- Generating done -- Build files have been written to: /home/skywo/workzone/vectorscan-utf8-test/build

result snapshot

image

markos commented 2 years ago

I confirm this. I was able to replicate it on my arm system and even on latest develop branch.

usmannisar commented 2 years ago

Can you check if compiling it with the -fsigned-char flag works? The C standard doesn't specify the signedness of char. On x86 char is signed by default while on Arm it is unsigned by default. Compiling with -fsigned-char will force char to be signed on Arm

iavjkae commented 2 years ago

Can you check if compiling it with the -fsigned-char flag works? The C standard doesn't specify the signedness of char. On x86 char is signed by default while on Arm it is unsigned by default. Compiling with -fsigned-char will force char to be signed on Arm

thank you for your suggestion. I recompiled with -fsigned-char flag and now it works. thanks. @markos Should it be compiled with the -fsigned-char flag by default on aarch64 platforms?

markos commented 2 years ago

It's probably a revisit of https://github.com/VectorCamp/vectorscan/issues/98. Reopening that and I will try to do a proper fix and replace char with explicit int8_t/uint8_t where appropriate.

dmbaggett commented 1 year ago

I have a bit more to report on this, in case it helps anyone else. On Graviton3 (AWS arm64 CPU with SVE extensions) running Ubuntu 22, building vectorscan from the 5.4.7 tag with GCC 11 -O2 or higher produces a library that crashes on startup when used with rspamd; it seems like GCC just generates bad code for this arch for some reason. (I did not attempt to debug what exactly it was doing.)

Building with clang-14 works much better and supports arbitrary optimization properly (I used -O3 -march=armv8.4-a+crc+crypto+sve) but this surfaces the issue mentioned above about signedness of char. I was able to address this by patching the Ragel inputs (.rl files) to add

alphtype unsigned char;

after each machine declaration. I also modifed ragel.cmake to call ragel with -G0 to force Ragel to emit a goto-based tokenizer instead of a table-based one, to get around char signedness mismatch compiler warnings.

Despite all that, I was still only able to get a working Vectorscan on Graviton3 with 5.4.7 code. When I build from 5.4.8 or current head code, rspamd just aborts at startup in hs_compile_multi. I didn't try to debug this either.

Finally, I noticed that HS_PATCH is still defined as 0 even though it seems like it should be 7 (for version 5.4.7).

markos commented 1 year ago

@dmbaggett Apologies for the delay. I can confirm that adding this line in Parser.rl this fixes the test on aarch64. I will add them to all the .rl files as you suggested. Regarding the failure for rspamd with 5.4.8, this is probably: https://github.com/VectorCamp/vectorscan/issues/140. A new release will be done shortly with all the fixes.

markos commented 1 year ago

yeah, a small update, this breaks the x86 tests, sigh, I'll have to find a way to solve it for both platforms.

markos commented 1 year ago

We can confirm this is fixed in #141 . A new release is going to happen very soon.