apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.49k stars 3.52k forks source link

[C++] String comparison in between ternary kernel #29865

Open asfimport opened 3 years ago

asfimport commented 3 years ago

String comparisons in C++ will use order by unicode. This may not be suitable in many language applications, for example when using characters from languages that use more than ASCII.   Sorting algorithms can often allow for the use of custom comparison functions.  It would be helpful to allow for this for the between kernel as well.  Initial work on the between kernel is being tracked in https://github.com/apache/arrow/issues/25881

Reporter: Benson Muite / @bkmgit

Related issues:

Note: This issue was originally created as ARROW-14290. Please see the migration documentation for further details.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Is this different from ARROW-12046 ?

asfimport commented 3 years ago

Benson Muite / @bkmgit: Yes, initial implementation does not allow for comparison of strings with keys, which is important for many applications. As you pointed out, a related issue for sorting is https://issues.apache.org/jira/browse/ARROW-12046

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Ah, so this is a request to allow a custom comparison or key function (I suspect the latter)? If so, can you make the title clearer?

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: By the way, does it need to be a specific option? You may also compute the desired key column and sort/compare based on it...

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: cc @lidavidm for insight.

asfimport commented 3 years ago

David Li / @lidavidm: Agreed, I'm not sure it's worth trying to allow custom comparison/key functions (I suspect even if we did, internally, we'd just compute a key column anyways). Could you provide perhaps an example of the use case? Unicode seems like it would cover multiple languages, unless you mean supporting non-Unicode encodings (e.g. Shift-JIS or Big5 or something).

asfimport commented 3 years ago

Benson Muite / @bkmgit: The order given by Unicode may not match text application.  Unicode has a collation algorithm  motivating examples they give in https://www.unicode.org/reports/tr10/#Example_Differences_Table for developing this include

asfimport commented 3 years ago

Benson Muite / @bkmgit: Some of the issues on ordering are discussed in https://en.wikipedia.org/wiki/Alphabetical_order#Similar_orderings

An example sorting program for multilingual dictionaries is Msort available under GPL3, but the description should give some idea of flexibility that may be required.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou:

An interface that allows returning of a comparison may allow flexibility to adapt for different circumstances. Arrow allows for flexible schema, so optimizations may be possible which are not possible with a regular column.

Which optimizations do you have in mind? I'm not sure that calling a function pointer O(n log n) times is generally better than computing an additional column in O(n) time.

Also, the function pointer has to be "dynamically typed" since its exact signature will depend on the datatype being sorted. Perhaps this is solvable using a base class and some virtual methods...

asfimport commented 3 years ago

Benson Muite / @bkmgit: Column stores are great for vectorization  as explained in this tutorial, though maybe there is something better.  My expectation is that the ordering comparison would use data that can be cached since it may be faster to stream all the data being compared O( column length ) and do a lookup O( log dictionary size ) for a total of O( column length * log alphabet size) operations, rather than stream the lookups past a small section of the data O(column length * alphabet size).. This depends on relative data sizes and constants determined by the hardware and software implementations though.

For numeric data, temporal data and text data where UTF8 encoding is sufficient for comparison, nothing special is needed and hardware support should be good.  For text data where the ordering used by UTF8 is not appropriate, being able to either call a separate column of data or call a separate table organized as a tree or call a separate function is helpful.  Calling a separate column of data would fit well within Arrow, but there may be situations where this is inappropriate and it may be good to allow for this.

asfimport commented 3 years ago

Weston Pace / @westonpace: If a user has custom comparison function that they need to run on a column of strings, then they could just create a new kernel function.

If a user has a column of data that always needs to be compared in some consistent and unique way (e.g. they have a column of alphanumeric data and the less/greater operations should apply a "natural sort" when comparing strings with numbers) then the correct answer I believe is an extension type.

If a user needs to apply some kind of non-scalar comparison function (e.g. consuming an entire column at once, managing some kind of independent cache, converting data from column-major to row-major, etc.) then they can create a custom execution node.

So we have several extension points already. I'm not sure what value there is in creating another one. It is not clear to me how this would be different or more user friendly. I agree that a single, concrete example is probably a good starting point.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: An extension type requires to rewrite all desired kernels, which is rather heavy-handed.

asfimport commented 3 years ago

Weston Pace / @westonpace: Hmm, fair point. A similar conversation happened in the discussion of intervals. In general, we decided we do not want to allow intervals to be comparable. However, some frontends may want to emulate postgres, which does allow for comparable intervals. The hope was those frontends could cast those intervals to "postgres" intervals for comparability. So it might be nice if the solution could extend beyond just strings (but maybe 2 instances does not make a pattern and we can solve the interval thing elsewhere).

One challenge is that the comparison functions are used implicitly by quite a few kernels (e.g. max, min, partition, select_k, etc.). In addition, if these custom rules affect equality (this might be the case given we are talking about unicode) then we have to worry about hashing and all of the kernels / nodes that rely on hashing internally (dictionary encode, group by, etc.)

asfimport commented 2 years ago

Todd Farmer / @toddfarmer: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.