apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.45k stars 3.52k forks source link

[C++][Python] Unregister compute functions #31611

Open asfimport opened 2 years ago

asfimport commented 2 years ago

In general, when using UDFs, the user defines a function expecting a particular outcome. When building the program, there needs to be a way to update existing function kernels if it expands beyond what is planned before. In such situations, there should be a way to remove the existing definition and add a new definition. To enable this, the unregister functionality has to be included. 

Reporter: Vibhatha Lakmal Abeykoon / @vibhatha Assignee: Vibhatha Lakmal Abeykoon / @vibhatha

Related issues:

Note: This issue was originally created as ARROW-16211. Please see the migration documentation for further details.

asfimport commented 2 years ago

Weston Pace / @westonpace: Another task we should tackle as part of this PR that came up during the discussion of ARROW-15639: https://github.com/apache/arrow/pull/12590#discussion_r854069757

If python is finalized then any python UDFs that have been registered will be invalid. We should consider adding an atexit hook to pyarrow which unregisters all python UDFs.

This is something of a corner case, as finalization usually happens at process exit, but in some cases users may have multiple python interpreters or be restarting the python interpreter.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Yes, an atexit hook is probably the way to go. It should be easy to do in Cython code.

asfimport commented 2 years ago

Yaron Gvili / @rtpsw: Another alternative to consider is registering Python UDFs to an extension registry instance that (1) is specific to the Python interpreter and (2) is linked to the default global one (so it can find both UDF and normal functions). This Python-specific registry would then be passed to be used by the execution engine. I think this way (only) the Python-specific registry would naturally get cleaned up on finalization of the Python interpreter.

asfimport commented 2 years ago

Yaron Gvili / @rtpsw: This second-layer-registry approach is good for another use case in which the user runs multiple execution engine invocations, either in sequence or in parallel, from the same Python interpreter and wants to keep separate the UDFs registered in each invocation.

asfimport commented 2 years ago

Yaron Gvili / @rtpsw: I created a PR for a nested extension-id-registry using the approach I proposed. Please let me know what you think, as I have my own use cases for this nested extension-registry-id.

asfimport commented 2 years ago

Yaron Gvili / @rtpsw: See this PR as a solution for this comment.

asfimport commented 2 years ago

Todd Farmer / @toddfarmer: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per project policy. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

asfimport commented 2 years ago

Yaron Gvili / @rtpsw: I believe this issue is superseded by ARROW-16677 which allows creating a nested registry that can later be dropped lock-free along with all functions registered on it.

asfimport commented 2 years ago

Vibhatha Lakmal Abeykoon / @vibhatha: @rtpsw I still need to look into this and verify. Thanks for the note.

asfimport commented 2 years ago

Vibhatha Lakmal Abeykoon / @vibhatha: @rtpsw  definitely this would be helpful in cleaning up things per interpreter. I am curious about this case. 

Imagine there is a UDF created in Python which does a particular job, but next in the middle of the program we need that function to have a different behavior (meaning another function needs to replace this function). So we expect the same function name, but a different functionality. In such a case we still have to remove an existing function with the same name and function signature. We cannot check the logic of the already registered function and the new function (in my understanding). 

On top of my head what I imagine is an active interpreter, where we do a demo of UDFs. I create a function called "func" and write some functionality. Then I want to write another example in the same interpreter instance with the same function name but a different functionality itself. Would this encourage deleting the function explicitly?

Please correct me if I am wrong.

cc @westonpace  

asfimport commented 2 years ago

Yaron Gvili / @rtpsw: I think many PyArrow function execution code paths lead to Arrow C++ APIs with a configurable function registry. For example, the CallFunction API accepts a function registry via its ExecContext argument. So, one could create a first nested function registry holding a first implementation of a UDF, use this first registry for several function execution invocations, then create a second nested function registry holding a second implementation of the UDF, and switch over to using this second registry. The Python interpreter remains the same, and there's no need to remove the UDF - the first registry can just be dropped.

asfimport commented 2 years ago

Vibhatha Lakmal Abeykoon / @vibhatha: What if subset of functions registered in that particular registry is needed and we just need another subset of functions. Then we have to re-register them. What about that situation?

asfimport commented 2 years ago

Vibhatha Lakmal Abeykoon / @vibhatha: cc @rtpsw @westonpace  

asfimport commented 2 years ago

Yaron Gvili / @rtpsw: Consider a case where X, Y, and Z are disjoint sets of functions. One can register (the functions of) X in registry 1, then register Y in registry 2 whose parent is registry 1, and finally register Z in registry 3 whose parent is registry 2. With respect to the case described by Vibhata, I believe the default registry corresponds to registry 1, the first subset of functions corresponds to Y, and the second to Z. This works nicely when X, Y, and Z are known upfront; otherwise, one may need to register again. For example, suppose Y1 and Y2 are disjoint sets of functions whose union is Y, and one wants to register functions on top of those of X and Y1, then one would need to create a fresh registry whose parent is registry 1, and register Y1 again on this fresh registry. I think this is a reasonable result when not knowing upfront. In general, considering registries have extended scopes, I think it is better to create a fresh nested registry and keep it fixed while in use than to edit an existing one using removals while in use.

asfimport commented 2 years ago

Vibhatha Lakmal Abeykoon / @vibhatha: Yes this is a good point. I think if we go on this route, we must make sure we properly define these things in docs or provide examples of using the UDFs with a few diverse examples covering these angles. Eitherway, I personally like the nested registry idea, it gives the flexibility decide these things prior to writing the programme. In general, we could assume one would put required effort to clarify these and then write the programme. I think this is fair enough to proceed with this approach. 

Also, we may need to include a few test cases on behalf of this and test the idea and make sure if things are not exposed to Python to create the nested registries as well. I am still browsing through the said change (nested registries). Please correct me if I am wrong.

cc @westonpace  this approach sounds fair enough for to proceed. Appreciate your thoughts.

asfimport commented 2 years ago

Yaron Gvili / @rtpsw: Side note: I have developed PyArrow wrappers for the nested registries as part of a prototype for UDFs that I should be able to contribute when the time comes.

I agree documentation of nested registries can be improved, both inside the code and in doc pages. We do have a couple of test cases. Please create jiras for what you think is missing.

asfimport commented 2 years ago

Vibhatha Lakmal Abeykoon / @vibhatha: Yeah I went through these cases before. I am also trying to allocate some time on adding a few more cases while testing UDF for nested registries. Thanks for working on this. I will keep you in the loop if there are new Jiras coming up with this work.

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: @rtpsw the test cases you link to are only about registering functions in nested registries, I think? But how would you use this in practice? You mentioned that in the CallFunction API, one can specify the registry to use through ExecContext. But that seems not very practical that you need to remember for each function in which registry it lives to ensure you pass the correct one when calling it. Also, that's only when you call it manually with CallFunction. But what if the UDF is part of a larger expression with multiple kernels that thus come from different registries. How do such things work? Does C++ automatically search the parent registry as well (that part is not really clear to me)?

In general having to keep track of registries and passing those, just to mimic the ability to re-register a UDF, seems quite complex. But I am wondering, for this specific case, wouldn't it be possible to allow to override an existing name when registering a kernel? It seems in C++ there is already a allow_overwrite option when adding functions, this could be exposed to the Python register function?

asfimport commented 2 years ago

Weston Pace / @westonpace: I think both use cases are probably useful and don't think either one precludes the other. For reference, when referring to UDFs, @rtpsw is (I'm pretty sure) referring to "embedded UDFs". In other words, UDFs whose code is embedded in a query plan (e.g. pickled python code). In these cases the UDF only really makes sense in the context of a single plan execution.

There is another case, where some set of UDFs are predefined and then referenced (e.g. by name) in incoming plans. In that scenario I think a nested registry is considerably less useful and the ability to unregister or override would be helpful.

How do such things work? Does C++ automatically search the parent registry as well (that part is not really clear to me)?

Yes, that is the current implementation.

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche:

There is another case, where some set of UDFs are predefined and then referenced (e.g. by name) in incoming plans. In that scenario I think a nested registry is considerably less useful and the ability to unregister or override would be helpful.

Yes, and that is the use case that I was talking about (since that is what the pyarrow register_scalar_function enabled you to do)

asfimport commented 2 years ago

Yaron Gvili / @rtpsw: Weston is correct that the main use case I designed nested registries for is embedded UDFs, or at least UDFs that are used in the context of a particular scope, such as a single plan's execution. However, see also the case I noted here.

But that seems not very practical that you need to remember for each function in which registry it lives to ensure you pass the correct one when calling it.

I believe you are referring to a burden on the user, but I don't think the user would actually need to do this. A normal case is when a first piece of code gets passed in a registry and wants to pass along a registry to a second piece of code. So, the first piece of code has two basic things it can do:

  1. Pass the registry unmodified; this keeps working in the same scope.
  2. Pass a nested registry wrapping it with some modifications; this creates a nested scope.

    In these cases, the registry being used at each piece of code is managed on stack, so when the second piece of code returns the first piece of code continues with its original registry, and the user need not manage registries manually. It is straightforward to extend this to other cases, like passing from a main thread to working threads.

    There is another case, where some set of UDFs are predefined and then referenced (e.g. by name) in incoming plans. In that scenario I think a nested registry is considerably less useful and the ability to unregister or override would be helpful. Yes, and that is the use case that I was talking about (since that is what the pyarrow register_scalar_function enabled you to do)

    Even in such a case where one must remove functions, I'd recommend creating a new registry instance with the desired functions removed (perhaps by initializing a builder from an existing registry instance) then to edit an existing registry instance that may be in use elsewhere, in order to make it much harder to (inadvertently) create side-effect or race conditions. Still, note that it is very easy for a nested registry to effectively support function removal, basically by registering a null function on a name, that overrides the same-named function of the parent registry; as usual, the nested registry can stay fixed once set up.

asfimport commented 2 years ago

Vibhatha Lakmal Abeykoon / @vibhatha: @rtpsw  I think what @jorisvandenbossche  referring to is the following process being a management for the user.

 I believe you are referring to a burden on the user, but I don't think the user would actually need to do this. A normal case  is when a first piece of code gets passed in a registry and wants to pass along a registry to a second piece of code. So, the first piece of code has two basic things it can do:

  1. Pass the registry unmodified; this keeps working in the same scope.
  2. Pass a nested registry wrapping it with some modifications; this creates a nested scope.

In the regular case it is just always pc.call_function(..), in this case we have to always make sure we do registry_x.call_function, isn't it? 

While with an approach of the ability to just drop what you don't need is way easier. I think just a usability piece considering one use case. 

May be we should also allow the ability to unregister/override functions. That would provide flexibility for the users to use the UDFs for the said scenarios.

asfimport commented 2 years ago

Yaron Gvili / @rtpsw: I'm not against a specific and simple solution for a simple use case, and you're welcome to pursue it. In this discussion, my main aim was to explain how all use cases discussed here are supported in a straightforward way using nested registries and without the need to modify a registry instance while in use.

In the regular case it is just always pc.call_function(..), in this case we have to always make sure we do registry_x.call_function, isn't it?

With this question, the discussion shifts from whether registry function removal is necessary (I argued it isn't) to how best to design a user API for calling registry functions in the context of at least this use case.

I argue we can design a user API that encapsulates the active registry, so that the function caller need not remember it, as follows. The execution context could manage a stack of nested registries, so that a call-function invocation would automatically lookup the registry at the top of the stack. When a piece of code wants to set up a nested registry for a second piece of code it intends to invoke, it does so by adding the nested registry to this stack, invoking the second piece of code, and popping the stack. This context stack management ensures the correct registry instance is always in scope.

Of course, that we can doesn't mean that we must. My aim in this point is to show that there is a well-designed alternative for registry function removal.

While with an approach of the ability to just drop what you don't need is way easier.

IMHO, it's a bit easier (e.g., removing a function from an existing registry instance vs creating a nested registry instance and removing from it) but less safe (potential side-effects and race conditions). A design tension between usability and safety is common, and calls for prioritization. My vote is to prioritize safety.

May be we should also allow the ability to unregister/override functions. That would provide flexibility for the users to use the UDFs for the said scenarios.

If I'm forced to accept this way of registry editing, I'd say that then the docs would need to be very clear about the safety issues this practice raises and to describe a safer alternative as discussed here. I think if the safer alternative is not implemented via an easy API (like the one I described) then users will surely practice the less-safe alternative. This is why I view that adding these docs is a bit better but still insufficient for safety.

asfimport commented 2 years ago

Weston Pace / @westonpace: I'm +1 on Yaron's points regarding safety.

Use cases:

asfimport commented 2 years ago

Vibhatha Lakmal Abeykoon / @vibhatha: I am also not against the safety features discussed. +1 on that. 

Only concern is about the usability. Not sure if that can be handled in this Jira. 

I like the idea about an API to make use the nested registries as a stack and make a caller which scans through the registries and finding the function being called. Should we create a separate ticket for this? 

Regarding the usecases, 

Probably the second use case can be added to this JIRA. But the third case seems to be a separate ticket.

cc @westonpace  

asfimport commented 2 years ago

Yaron Gvili / @rtpsw:

Only concern is about the usability. Not sure if that can be handled in this Jira. 

Makes sense to tackle usability of the nested registry approach in a separate jira, perhaps in the context of one of the use cases Weston listed.

I like the idea about an API to make use the nested registries as a stack and make a caller which scans through the registries and finding the function being called. Should we create a separate ticket for this? 

The caller would only need to pick up and query the nested registry instance at the top of the stack. This is because each nested registry instance except the bottom one is linked to a registry instance one step lower on the stack via a parent link and will refer to its parent automatically.

Yes, I think it is clearer to handle the implementation of a nested-registry-stack in one issue and any use of it (like in CallFunction or pc.call_function) in another issue.

asfimport commented 2 years ago

Vibhatha Lakmal Abeykoon / @vibhatha: +1 , I agree. @jorisvandenbossche  any thoughts? 

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: I agree with the general idea of nested compute registries. For now we should keep those explicit.

asfimport commented 2 years ago

Vibhatha Lakmal Abeykoon / @vibhatha: Great, I will update this PR accordingly.

asfimport commented 1 year ago

Yaron Gvili / @rtpsw: Which PR would this be?

asfimport commented 1 year ago

Vibhatha Lakmal Abeykoon / @vibhatha: This is the PR

https://github.com/apache/arrow/pull/14308

– Vibhatha Abeykoon