[C++] String algorithm library for StringArray/BinaryArray

asfimport commented 7 years ago

This is a parent JIRA for starting a module for processing strings in-memory arranged in Arrow format. This will include using the re2 C++ regular expression library and other standard string manipulations (such as those found on Python's string objects)

Reporter: Wes McKinney / @wesm

Related issues:

[C++/Python] Kernel for SetItem(IntegerArray, values) (is a parent of)
[C++] String concatenate aggregate kernel (is a parent of)
[C++] Implement string space trimming kernels: trim, ltrim, and rtrim (is a parent of)
[C++] Add string struct extract kernel using re2 (is a parent of)
[C++] Add scalar string join kernel (is a parent of)
[C++] Add string length kernel (is a parent of)
[C++] Add regex string match kernel (is a parent of)
[C++][Python][Compute] String hex to numeric conversion and bit shifting (is a parent of)
[C++] Add variadic string join kernel (is a parent of)
[C++] String repeat kernel (is a parent of)
[C++] String reverse kernel (is a parent of)
[C++] String title case kernel (is a parent of)
[C++] SQL-style glob string match kernel (is a parent of)
[C++] Left/right/center string padding kernels (is a parent of)
[C++] Substring find position kernel (is a parent of)
[C++] String capitalize kernel (is a parent of)
[C++] String swap case kernel (is a parent of)
[C++] Add string slice replace kernel (is a parent of)
[C++] Add string starts-with/ends-with kernels (is a parent of)
[C++] Add substring count kernel (is a parent of)
[C++] Add regex count kernel (is a parent of)
[C++] Add find_substring_regex kernel and implement ignore_case for find_substring (is a parent of)
[C++] Implement min_max kernel for array[string] (is a parent of)
[C++] String formatting kernel (is a parent of)
[C++][Python][Compute] Number to string hex conversion (is a parent of)
[C++][Compute] Add string translate kernel (is a parent of)
[C++] Add is{alnum,alpha,...} kernels for strings (is a parent of)
[C++] split kernels for strings/binary (is a parent of)
[C++] Add scalar string slicing/substring extract kernel (is a parent of)
[C++] Add split_pattern_regex function (is a parent of)
[C++] Implement case insenstive match in match_substring(_regex) and match_like (is a parent of)
[C++] Add string replacement kernel (is a parent of)
[C++][Compute] Add binary reverse kernel (is a parent of)
[C++][Compute] Add utf8_format kernel (is a parent of)
[C++] Add locale support for relevant string compute functions (is a parent of)
[C++] Add ascii_lower kernel (is a parent of)
[C++] C++ array kernels framework and execution buildout (umbrella issue) (is a child of)
[C++] Faster ascii_lower and ascii_upper (relates to)
[C++] Add utf8_upper and utf8_lower (relates to)
[C++] Implement example string scalar kernel function to assist with string kernels buildout per ARROW-555 (is related to)
[C++/Python] Support necessary functionality to have an Arrow-string type in pandas (is depended upon by)
PRs and other links:
Fletcher string kernel issue

_{Note: This issue was originally created as ARROW-555. Please see the migration documentation for further details.}

asfimport commented 5 years ago

Wes McKinney / @wesm: Now that re2 is in our toolchain, we can implement kernels for each type of regular expression operation