Open asfimport opened 3 years ago
Antoine Pitrou / @pitrou: Is this different from ARROW-12046 ?
Benson Muite / @bkmgit: Yes, initial implementation does not allow for comparison of strings with keys, which is important for many applications. As you pointed out, a related issue for sorting is https://issues.apache.org/jira/browse/ARROW-12046
Antoine Pitrou / @pitrou: Ah, so this is a request to allow a custom comparison or key function (I suspect the latter)? If so, can you make the title clearer?
Antoine Pitrou / @pitrou: By the way, does it need to be a specific option? You may also compute the desired key column and sort/compare based on it...
Antoine Pitrou / @pitrou: cc @lidavidm for insight.
David Li / @lidavidm: Agreed, I'm not sure it's worth trying to allow custom comparison/key functions (I suspect even if we did, internally, we'd just compute a key column anyways). Could you provide perhaps an example of the use case? Unicode seems like it would cover multiple languages, unless you mean supporting non-Unicode encodings (e.g. Shift-JIS or Big5 or something).
Benson Muite / @bkmgit: The order given by Unicode may not match text application. Unicode has a collation algorithm motivating examples they give in https://www.unicode.org/reports/tr10/#Example_Differences_Table for developing this include
In a German Dictionary of < öf but in a German Phonebook öf < of
Being able to do fast sorting and comparison on text is important in a database and the unicode encodings do not give orderings that matche all applications. Thus, we want to enable application developers to choose an ordering appropriate for their applications, ideally without re-implementing comparison and sorting. It may be the case that language plugins or extensions are easier to support. An internal lookup table is ok, but the table size may vary from ~20 to ~5000 rows for a single language. If one needs to work with Emojis, things get more interesting. UTF8 can encode 1,112,064 code points and there are cases where two code points are used to encode one item, so something complete may have very poor worst case running time. One may also want to use another text encoding, for example UTF32, EUC-KR, SJIS, EUC-JP - especially since then sorting, searching and comparisons can then be done more efficiently. An interface that allows returning of a comparison may allow flexibility to adapt for different circumstances. Arrow allows for flexible schema, so optimizations may be possible which are not possible with a regular column.
Benson Muite / @bkmgit: Some of the issues on ordering are discussed in https://en.wikipedia.org/wiki/Alphabetical_order#Similar_orderings
An example sorting program for multilingual dictionaries is Msort available under GPL3, but the description should give some idea of flexibility that may be required.
Antoine Pitrou / @pitrou:
An interface that allows returning of a comparison may allow flexibility to adapt for different circumstances. Arrow allows for flexible schema, so optimizations may be possible which are not possible with a regular column.
Which optimizations do you have in mind? I'm not sure that calling a function pointer O(n log n) times is generally better than computing an additional column in O(n) time.
Also, the function pointer has to be "dynamically typed" since its exact signature will depend on the datatype being sorted. Perhaps this is solvable using a base class and some virtual methods...
Benson Muite / @bkmgit: Column stores are great for vectorization as explained in this tutorial, though maybe there is something better. My expectation is that the ordering comparison would use data that can be cached since it may be faster to stream all the data being compared O( column length ) and do a lookup O( log dictionary size ) for a total of O( column length * log alphabet size) operations, rather than stream the lookups past a small section of the data O(column length * alphabet size).. This depends on relative data sizes and constants determined by the hardware and software implementations though.
For numeric data, temporal data and text data where UTF8 encoding is sufficient for comparison, nothing special is needed and hardware support should be good. For text data where the ordering used by UTF8 is not appropriate, being able to either call a separate column of data or call a separate table organized as a tree or call a separate function is helpful. Calling a separate column of data would fit well within Arrow, but there may be situations where this is inappropriate and it may be good to allow for this.
Weston Pace / @westonpace: If a user has custom comparison function that they need to run on a column of strings, then they could just create a new kernel function.
If a user has a column of data that always needs to be compared in some consistent and unique way (e.g. they have a column of alphanumeric data and the less/greater operations should apply a "natural sort" when comparing strings with numbers) then the correct answer I believe is an extension type.
If a user needs to apply some kind of non-scalar comparison function (e.g. consuming an entire column at once, managing some kind of independent cache, converting data from column-major to row-major, etc.) then they can create a custom execution node.
So we have several extension points already. I'm not sure what value there is in creating another one. It is not clear to me how this would be different or more user friendly. I agree that a single, concrete example is probably a good starting point.
Antoine Pitrou / @pitrou: An extension type requires to rewrite all desired kernels, which is rather heavy-handed.
Weston Pace / @westonpace: Hmm, fair point. A similar conversation happened in the discussion of intervals. In general, we decided we do not want to allow intervals to be comparable. However, some frontends may want to emulate postgres, which does allow for comparable intervals. The hope was those frontends could cast those intervals to "postgres" intervals for comparability. So it might be nice if the solution could extend beyond just strings (but maybe 2 instances does not make a pattern and we can solve the interval thing elsewhere).
One challenge is that the comparison functions are used implicitly by quite a few kernels (e.g. max, min, partition, select_k, etc.). In addition, if these custom rules affect equality (this might be the case given we are talking about unicode) then we have to worry about hashing and all of the kernels / nodes that rely on hashing internally (dictionary encode, group by, etc.)
Todd Farmer / @toddfarmer: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.
String comparisons in C++ will use order by unicode. This may not be suitable in many language applications, for example when using characters from languages that use more than ASCII. Sorting algorithms can often allow for the use of custom comparison functions. It would be helpful to allow for this for the between kernel as well. Initial work on the between kernel is being tracked in https://github.com/apache/arrow/issues/25881
Reporter: Benson Muite / @bkmgit
Related issues:
Note: This issue was originally created as ARROW-14290. Please see the migration documentation for further details.