Open asfimport opened 7 years ago
Wes McKinney / @wesm: Now that re2 is in our toolchain, we can implement kernels for each type of regular expression operation
Wes McKinney / @wesm: cc @maartenbreddels
Joris Van den Bossche / @jorisvandenbossche: Do we already have a good idea of how we want to approach this? Because I think there has been some discussion on implementing custom C++ kernels (similar to other existing kernels in the compute module) vs finding a way to re-use the scalar kernels that are already implemented for gandiva.
For reference: Gandiva already has several string functions implemented. Illustration with the python interface for the "upper" function:
from pyarrow import gandiva
table = pa.table({'a': ['a', 'b', 'c']})
builder = gandiva.TreeExprBuilder()
node_a = builder.make_field(table.schema.field("a"))
node_upper = builder.make_function("upper", [node_a], pa.string())
field_result = pa.field('res', pa.string())
expr = builder.make_expression(node_upper, field_result)
projector = gandiva.make_projector(table.schema, [expr], pa.default_memory_pool())
>>> projector.evaluate(table.to_batches()[0])
[<pyarrow.lib.StringArray object at 0x7fc324f71580>
[
"A",
"B",
"C"
]]
Maarten Breddels / @maartenbreddels: Related: https://issues.apache.org/jira/browse/ARROW-7083
I will probably start working on this a few weeks from now. My initial intention would be to separate the algorithms as much as possible so it would be possible to add them both to gandiva and a 'bare' kernel, or with a minimal amount of refactoring.
@wesm: what's your reason to choose re2? Gandiva and vaex both use pcre, but I have no strong preference (except being a bit familiar with pcre).
Wes McKinney / @wesm: We've been having some discussions about this topic in other places, e.g. ARROW-7083. One idea that has been proposed is to generate single-function kernels at compile time based on the LLVM IR that Gandiva spits out. So the process would work like this:
Implement a generic "invoker" that takes a C function kernel (the result of compiling the LLVM IR produced by Gandiva) and evaluates it (with memory allocation, etc. as needed)
Then the LLVM runtime would not be required to use the output of this process.
This would require some investment of time (perhaps not that much) to set up the machinery to enable this, but it would seem to greatly simplify the process of implementing new kernels, especially simple elementwise functions (for numbers, strings, etc.)
We've been dancing around this idea for several months now so I would be interested to see if someone would be interested to explore this before tunneling too far in different directions.
cc @emkornfield @pitrou @fsaintjacques @jacques-n [~ravindra]
for any comments / thoughts if what I've written above jives with prior discussions
Antoine Pitrou / @pitrou: I think going through LLVM IR is a bit convoluted. More simply, since those functions are already raw C in Gandiva, we could reintegrate those C functions somewhere in Arrow (taking care that the Gandiva toolchain can still compile them to LLVM bitcode).
It would also avoid depending on LLVM for builds with Gandiva disabled.
Antoine Pitrou / @pitrou:
I'm assuming C functions btw, but those may just as well be C++ functions (with the C wrappers on the Gandiva side). However, they can't use certain C++ stdlib facilities such as iostream
(hence the split between Decimal
and BasicDecimal
).
Maarten Breddels / @maartenbreddels: What are the limitation, and is this somewhere documented? It might be good to keep those in mind.
Wes McKinney / @wesm: @pitrou that seems reasonable, cross-compilation (where a code unit is compiled both into a static/shared lib and to LLVM IR at the same time) indeed would be easier. This is a popular technique (e.g. Apache Impala does a lot of it – see all files with "-ir" in them in https://github.com/apache/impala/tree/master/be/src/exprs) so we should try not to reinvent the wheel
Wes McKinney / @wesm: We could even use Impala's string exprs (which is what Impala calls its "kernels") as a guideline for what we need to have available as Arrow kernels
https://github.com/apache/impala/blob/master/be/src/exprs/string-functions.h
Micah Kornfield / @emkornfield: +1 for simplicity, I think it is unlikely I will have time to contribute to this effort in the near future.
Wes McKinney / @wesm: Update: I'm in the middle of an overhaul of the API for implementing new Array functions / kernels, with the goal of making it much easier to add new functions (e.g. generating a string function given an inlineable implementation of computing a single value). Once that's done (since I'm working on it right now, it will be this month) I will probably ask someone from my team to make an initial cut at a precompiled string function set based on the functions that are already in Gandiva / LLVM codegen and add new functions (from e.g. Impala or other SQL engines) that are not yet present. The work need not be monolithic so as soon as the framework is in place it should be straightforward to add new functions and test them. Additionally, adding Python bindings for the new functions should also be easy (all you will need is the name of the function you're calling, so some of the Cython binding boilerplate that exists now should also go away).
Maarten Breddels / @maartenbreddels: I am likely to be able to start working on strings in Arrow this month, so I think the timing is good. Some pointers/examples to get me started would be great.
Wes McKinney / @wesm: Cool. I will circle back here once I have a PR up for the work I described in my comment, and will add an example string function to provide a template for adding more functions.
Maarten Breddels / @maartenbreddels: Something to consider (or should I move this discussion to the list?), is the support of ASCII vs utf8. I noticed the Gandiva code assumed ASCII (at least not utf8), while Arrow assumes strings are utf8 only. Having written the vaex string code, I'm pretty sure ASCII will be much faster though (you know the byte length of a string in advance). Is there interest in supporting more than utf8, ASCII for instance, or utf16/32? Or should it be utf8 only?
Wes McKinney / @wesm: Having ASCII versions of functions sounds fine to me. There is a PR up now for fast ASCII validation also
Wes McKinney / @wesm: I just made a PR for the new kernels framework that I was talking about
https://github.com/apache/arrow/pull/7240
There's a little bit of work still to provide the machinery to generate string kernels from scalar-valued prototypes, but I was thinking I would do that sometime in the next few days and provide an example string kernel for you to use as a template for adding more kernels. Does that sound good?
Maarten Breddels / @maartenbreddels: Sounds good. I think it would help me a lot to see str->scalar and str->str (and possibly a str->[str, str]) example. They can be trivial, like always return ["a", "b"], but with that, I can probably get up to speed very quickly, if it's not too much to ask.
Wes McKinney / @wesm:
Yes, that's the idea. I can try to implement str.split
which would be String -> List<String>
in Arrow types.
Wes McKinney / @wesm: What items are still outstanding here, could we create additional issues and attach them for visibility?
Neal Richardson / @nealrichardson: We have a few that aren't linked here, I can attach what I know of.
This is a parent JIRA for starting a module for processing strings in-memory arranged in Arrow format. This will include using the re2 C++ regular expression library and other standard string manipulations (such as those found on Python's string objects)
Reporter: Wes McKinney / @wesm
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-555. Please see the migration documentation for further details.