Open cansjt opened 2 years ago
Greping through the code on the main
branch, I cannot find any place where the __init_kwargs
attribute is actually ever read. It is set in different places, but not read:
$ git grep _init_kwargs
airflow/models/baseoperator.py:406: if not hasattr(self, '_BaseOperator__init_kwargs'):
airflow/models/baseoperator.py:407: self._BaseOperator__init_kwargs = {}
airflow/models/baseoperator.py:413: self._BaseOperator__init_kwargs.update(kwargs) # type: ignore
airflow/models/baseoperator.py:680: __init_kwargs: Dict[str, Any]
airflow/models/baseoperator.py:755: self.__init_kwargs = {}
airflow/models/baseoperator.py:1007: if key in self.__init_kwargs:
airflow/models/baseoperator.py:1008: self.__init_kwargs[key] = value
airflow/models/baseoperator.py:1168: result.__init_kwargs = init_kwargs = {}
airflow/models/baseoperator.py:1170: if k in ("_BaseOperator__instantiated", "_BaseOperator__init_kwargs"):
airflow/models/baseoperator.py:1178: for k, v in self.__dict__["_BaseOperator__init_kwargs"]:
airflow/models/baseoperator.py:1466: '_BaseOperator__init_kwargs',
What am I missing?
I think if you want to propose something, it's better to open PR and discuss it there adding your explanation over the code. This will make it far more productive discussion than trying to wrap the head around copy & pasted code from various places. This will take a long time for anyone looking at it to spend their energy on, and making a draft PR with what you proposed to do is far better IMHO. I have now big knowledge about this part, but I kinda dread having to take a look and try to understand what you want to do and why, becaus of all the copy &pasted code.
Just sayin;
@cansjt do you plan to open a PR?
I'm facing the same problem in 2.5.0. Can I work on this?
sure. assigned
I have noticed that this issue also leads to a slightly more severe problems when Apache Airflow is deployed on Kubernetes. The problem seems to always result in the first pod of a task being caught in the Error state while the task still finishes. I've been debugging this, but to no avail. Shall I raise a new issue? perhaps a better description would make it easier for someone to pick this up.
Yes. If you can refer to that one and have a super-easy reproducible path, creating a new issue and marking it as "Related to:" is a good idea. We can then close that one as duplicate.
This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.
Apache Airflow version
2.3.3 (latest released)
What happened
When implementing a custom operator I stumbled on the following issue:
<OperatorClass>.shallow_copy_attrs
class attribute);__init__()
method;__init__()
method are captured in the instances_BaseOperator__init_kwargs
attribute;_BaseOperator__init_kwargs
is also copied;__init__()
then it is deepcopied anyway, when copying the content of the_BaseOperator__init_kwargs
dictionary, despite being specified as that it should not be deepcopied;Note that there is an additional difficulty to this problem: the names of the
kwargs
do not necessarily match the name of the instance attribute. It can be easily worked around, by adding both name to the class'shallow_copy_attrs
list. That things a bit redundant, though.Not sure what the
_BaseOperator__init_kwargs
is used for, but if one can deepcopy an operator, I cannot help but wonder why it is needed?What you think should happen instead
The argument passed to the
__init__()
method should be shallow copied as expected / requested by the operator implementor, following the contract that attribute values listed inshallow_copy_attr
should be shallow copied.How to reproduce
I discovered the issue because, in a custom operator I was passing a object (from a third party package) that has a mis-implemented
__getattr__()
and it was getting copied anyways, having my DAG fall in an infinite recursion when attempting to copy it.Note that for brevity I also took a little shortcut: in the real case, the faulty instance is not directly attached to the operator instance but an attribute on another object that is itself attached to the operator. Which could make the use of the
shallow_copy_attrs
effective where it is here ineffective in the first example. One can consider that the adapter class, in the examples below, takes the role of the intermediate object. Still the intermediate being deepcopied, when it shouldn't, make setshallow_copy_attrs
ineffective.This first example, shows the initial situation:
Running the code above fails with the exception:
Here is a way to work around the problem. Assuming the adapter class below can somehow reconstruct the faulty instance. In my case it is possible (not showed here for brevity):
It's a bit of work but until the third party library is fixed, that works for me.
Now if I uncomment the
misbehavedparam
kwarg in the above example:The infinite recursion is back again, for the reasons exposed above (copy of the
_BaseOperator__init_kwargs
attribute, which has captured a reference to the faulty instance)So for now, to work around the problem, I have to set the instance attribute outside of the constructor, I added a setter (:cry:) to wrap it with the adapter:
Operating System
Debian
Versions of Apache Airflow Providers
not relevant.
Deployment
Other
Deployment details
not relevant
Anything else
We could make a special case of the copy of that attribute. There is already one for the copy of the
_BaseOperator__instantiated
one. Remains the question of how do we want to handle that special case?Have users list kwargs as part of the
shallow_copy_attrs
list? I find it is exposing and burdening users with implementation details they should not care about.Do something a bit more complex, but also probably slower? e.g. I was thinking of doing something like:
shallow_map = {} for k, v in self.dict.items():
if k == "_BaseOperator__instantiated":
if k in ("_BaseOperator__instantiated", "_BaseOperator__init_kwargs"):
Don't set this until the end, as it changes behaviour of setattr
shallow_map = {id(v): getattr(result, k)}
result.__init_kwargs = init_kwargs = {}
for k, v in self.dict["_BaseOperator__init_kwargs"]:
id_ = id(v)
if id_ in shallow_map:
init_kwargs[k] = shallowmap[id]
elif id_ in memo:
initkwargs[k] = memo[id]
else:
init_kwargs[k] = copy.deepcopy(v, memo)
Shall we use the
memo
dict instead of the additionalshallow_map
dict? Could prevent the same issue to happen further down the line, if the same value is somehow referenced elsewhere.We should also note that keeping the copy of this
__init_kwargs
dict means you leave the burden of "fencing" against misbehaved objects outside of the operator__init__()
. Meaning you force users to do something like:Instead of letting the operator deal with it internally:
Are you willing to submit PR?
Code of Conduct