[C++][Python] UDF Optimizations

apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

https://arrow.apache.org/

Apache License 2.0

14.28k stars 3.47k forks source link

[C++][Python] UDF Optimizations #31096

Open asfimport opened 2 years ago

asfimport commented 2 years ago

Need an interface to evaluate the memory footprint, execution time and health of the UDFs and return a meaningful status ex: Status::HighMemoryUsageException(), Status::TimeLimitException()

Note: This is also aligned with resource monitoring in the parallel execution space.

Reporter: Vibhatha Lakmal Abeykoon / @vibhatha

_{Note: This issue was originally created as ARROW-15637. Please see the migration documentation for further details.}

asfimport commented 2 years ago

Vibhatha Lakmal Abeykoon / @vibhatha: In this context we can also analyze further about data conversions that may be happening within the UDFs for data structures not supported by Arrow. Most of the data science or data engineering applications in the Python space use Pandas or Numpy based data structures, so it won't be a serious problems, but it is nice to keep an eye on possible situations where there are exceptions to these cases.

asfimport commented 2 years ago

Vibhatha Lakmal Abeykoon / @vibhatha: Also worth noting the performance limitations in the UDFs executed per each row.

asfimport commented 2 years ago

Todd Farmer / @toddfarmer: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

asfimport commented 1 year ago

Apache Arrow JIRA Bot: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per project policy. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.