apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.24k stars 3.47k forks source link

[C++] String algorithm library for StringArray/BinaryArray #16192

Open asfimport opened 7 years ago

asfimport commented 7 years ago

This is a parent JIRA for starting a module for processing strings in-memory arranged in Arrow format. This will include using the re2 C++ regular expression library and other standard string manipulations (such as those found on Python's string objects)

Reporter: Wes McKinney / @wesm

Related issues:

Note: This issue was originally created as ARROW-555. Please see the migration documentation for further details.

asfimport commented 5 years ago

Wes McKinney / @wesm: Now that re2 is in our toolchain, we can implement kernels for each type of regular expression operation

asfimport commented 4 years ago

Wes McKinney / @wesm: cc @maartenbreddels

asfimport commented 4 years ago

Joris Van den Bossche / @jorisvandenbossche: Do we already have a good idea of how we want to approach this? Because I think there has been some discussion on implementing custom C++ kernels (similar to other existing kernels in the compute module) vs finding a way to re-use the scalar kernels that are already implemented for gandiva.

For reference: Gandiva already has several string functions implemented. Illustration with the python interface for the "upper" function:


from pyarrow import gandiva
table = pa.table({'a': ['a', 'b', 'c']})

builder = gandiva.TreeExprBuilder()
node_a = builder.make_field(table.schema.field("a"))
node_upper = builder.make_function("upper", [node_a], pa.string())
field_result = pa.field('res', pa.string())
expr = builder.make_expression(node_upper, field_result)
projector = gandiva.make_projector(table.schema, [expr], pa.default_memory_pool())

>>> projector.evaluate(table.to_batches()[0])
[<pyarrow.lib.StringArray object at 0x7fc324f71580>
 [
   "A",
   "B",
   "C"
 ]]
asfimport commented 4 years ago

Maarten Breddels / @maartenbreddels: Related: https://issues.apache.org/jira/browse/ARROW-7083

I will probably start working on this a few weeks from now. My initial intention would be to separate the algorithms as much as possible so it would be possible to add them both to gandiva and a 'bare' kernel, or with a minimal amount of refactoring.

@wesm: what's your reason to choose re2? Gandiva and vaex both use pcre, but I have no strong preference (except being a bit familiar with pcre).

 

asfimport commented 4 years ago

Wes McKinney / @wesm: We've been having some discussions about this topic in other places, e.g. ARROW-7083. One idea that has been proposed is to generate single-function kernels at compile time based on the LLVM IR that Gandiva spits out. So the process would work like this:

asfimport commented 4 years ago

Antoine Pitrou / @pitrou: I think going through LLVM IR is a bit convoluted. More simply, since those functions are already raw C in Gandiva, we could reintegrate those C functions somewhere in Arrow (taking care that the Gandiva toolchain can still compile them to LLVM bitcode).

It would also avoid depending on LLVM for builds with Gandiva disabled.

asfimport commented 4 years ago

Antoine Pitrou / @pitrou: I'm assuming C functions btw, but those may just as well be C++ functions (with the C wrappers on the Gandiva side). However, they can't use certain C++ stdlib facilities such as iostream (hence the split between Decimal and BasicDecimal).

asfimport commented 4 years ago

Maarten Breddels / @maartenbreddels: What are the limitation, and is this somewhere documented? It might be good to keep those in mind.

asfimport commented 4 years ago

Wes McKinney / @wesm: @pitrou that seems reasonable, cross-compilation (where a code unit is compiled both into a static/shared lib and to LLVM IR at the same time) indeed would be easier. This is a popular technique (e.g. Apache Impala does a lot of it – see all files with "-ir" in them in https://github.com/apache/impala/tree/master/be/src/exprs) so we should try not to reinvent the wheel

asfimport commented 4 years ago

Wes McKinney / @wesm: We could even use Impala's string exprs (which is what Impala calls its "kernels") as a guideline for what we need to have available as Arrow kernels

https://github.com/apache/impala/blob/master/be/src/exprs/string-functions.h

asfimport commented 4 years ago

Micah Kornfield / @emkornfield: +1 for simplicity, I think it is unlikely I will have time to contribute to this effort in the near future.

asfimport commented 4 years ago

Wes McKinney / @wesm: Update: I'm in the middle of an overhaul of the API for implementing new Array functions / kernels, with the goal of making it much easier to add new functions (e.g. generating a string function given an inlineable implementation of computing a single value). Once that's done (since I'm working on it right now, it will be this month) I will probably ask someone from my team to make an initial cut at a precompiled string function set based on the functions that are already in Gandiva / LLVM codegen and add new functions (from e.g. Impala or other SQL engines) that are not yet present. The work need not be monolithic so as soon as the framework is in place it should be straightforward to add new functions and test them. Additionally, adding Python bindings for the new functions should also be easy (all you will need is the name of the function you're calling, so some of the Cython binding boilerplate that exists now should also go away).

asfimport commented 4 years ago

Maarten Breddels / @maartenbreddels: I am likely to be able to start working on strings in Arrow this month, so I think the timing is good. Some pointers/examples to get me started would be great.

asfimport commented 4 years ago

Wes McKinney / @wesm: Cool. I will circle back here once I have a PR up for the work I described in my comment, and will add an example string function to provide a template for adding more functions.

asfimport commented 4 years ago

Maarten Breddels / @maartenbreddels: Something to consider (or should I move this discussion to the list?), is the support of ASCII vs utf8. I noticed the Gandiva code assumed ASCII (at least not utf8), while Arrow assumes strings are utf8 only. Having written the vaex string code, I'm pretty sure ASCII will be much faster though (you know the byte length of a string in advance). Is there interest in supporting more than utf8, ASCII for instance, or utf16/32? Or should it be utf8 only?

asfimport commented 4 years ago

Wes McKinney / @wesm: Having ASCII versions of functions sounds fine to me. There is a PR up now for fast ASCII validation also

asfimport commented 4 years ago

Wes McKinney / @wesm: I just made a PR for the new kernels framework that I was talking about

https://github.com/apache/arrow/pull/7240

There's a little bit of work still to provide the machinery to generate string kernels from scalar-valued prototypes, but I was thinking I would do that sometime in the next few days and provide an example string kernel for you to use as a template for adding more kernels. Does that sound good?

asfimport commented 4 years ago

Maarten Breddels / @maartenbreddels: Sounds good. I think it would help me a lot to see str->scalar and str->str (and possibly a str->[str, str]) example. They can be trivial, like always return ["a", "b"], but with that, I can probably get up to speed very quickly, if it's not too much to ask. 

asfimport commented 4 years ago

Wes McKinney / @wesm: Yes, that's the idea. I can try to implement str.split which would be String -> List<String> in Arrow types.

asfimport commented 3 years ago

Wes McKinney / @wesm: What items are still outstanding here, could we create additional issues and attach them for visibility?

asfimport commented 3 years ago

Neal Richardson / @nealrichardson: We have a few that aren't linked here, I can attach what I know of.

asfimport commented 3 years ago

Ian Cook / @ianmcook: I added ~10 issues for new kernels corresponding to commonly used SQL string functions