abelcheung / types-lxml

Complete lxml external type annotation
Apache License 2.0
39 stars 5 forks source link

`Self` narrows with custom element class #51

Closed scanny closed 5 months ago

scanny commented 6 months ago

element: Self is used in multiple places in the stub for etree._Element. Here's a typical example: https://github.com/abelcheung/types-lxml/blob/main/lxml-stubs/etree/_element.pyi#L97

This produces over-narrowing of the type when used with custom element classes, like this:

class Frombus(etree._Element):
    def add_child(self):
        self.insert(0, element_that_is_not_a_frombus)
# --------------- ^^^ Argument of type "NotAFrombus" cannot be assigned to parameter
#                     "element" of type "Self@Frombus" in function insert

This is because Self takes the type of self (Frombus in this case) and only wants to allow elements of the same type to be inserted.

I think the solution is to use "_Element" in most (or maybe all) places where Self appears within the _Element stub, which I expect was the original intent (that element can be any subtype of _Element). If you add from __future__ import annotations at the top of the file I believe you can leave off the double-quotes.

This works as intended when I patch in these changes locally.

abelcheung commented 6 months ago

@scanny It looks like you want to construct a heterogeneous tree (please correct me otherwise). Current implementation is geared towards creating homogeneous tree. With the changes you suggested, it is indeed smoother for manipulation of heterogeneous tree. Actually types-lxml was working in this way before 2023.3.28 version.

The reason was that using _Element can cause a loss of context in methods returning elements for homogeneous tree, such as .getparent() or .iter(). I'm aware of conflict of interest in homogeneous and heterogeneous tree users, and don't have a solution satisfying both with single code base, so opted for (arguably) the larger use case.

I'll try to think through it in these few days. Please feel free to suggest any idea alleviating the situation if you have any.

scanny commented 6 months ago

@abelcheung I think it's fair to say that the general case is that XML trees are heterogenous, at least as far as element-types ("tags") are concerned. Consider a garden-variety XML structure:

<person>
  <name>Bob</name>
  <street_address apt=false>123 Main Street</street_address>
  ...
</person>

as opposed to a homogenous:

<thing>
  <thing>abc</thing>
  <thing>
    def
    <thing>ghi</thing>
  </thing>
</thing>

I'm not sure what a "homogenous" tree would model.

While it is possible to use the same custom element class to model all element types (xsd:complexType definitions in XML Schema), this a degenerate case, even if it is more common (I've seen no evidence to suggest this btw). The general case is that you can use an arbitrary custom element class for any element, and the only thing those elements are guaranteed to have in common is isinstance(_, _Element).

So specializing these stubs to Self is an over-narrowing. Also note that it does not benefit these stubs. The methods described are perfectly able to do .insert() or whatever else with any _Element subtype, not just with whatever Self turns out to be. So you are adding a constraint that is not there in the library code.

Let me know what you decide. This would be a deal-breaker for me and I think relegate this project to special-case use.

scanny commented 6 months ago

@abelcheung Regarding the "loss of context" problem, I definitely know what you mean, I have certainly encountered this in my own code (I'm the author of python-docx and python-pptx, both of which rely heavily on lxml because DOCX and PPTX are XML file formats).

In my view, this is unavoidable and I remedy it by casting (there are one or two other ways).

As a contrived example, a run is a child of a paragraph, so I would need something like this when using .getparent():

def fn(r: CT_R):
    """CT_R is the custom element-class for a Run in python-docx.

    Its semantics are described by the XML Schema complex-type of the same name.
    """
    p = cast(CT_P, r.getparent())
    # ... do things with CT_P element, representing a paragraph 

This sort of thing occasionally happens in all the code bases I work in. There is nothing in the type system that can tell me the specific type that .getparent() is going to return to me in this case. It is based on knowledge of the XML Schema that cannot be embodied in the types.

This could be mitigated by adding a property CT_R.enclosing_p property like so, which at least encapsulates the cast:

class CT_R:
    @property
    def enclosing_p(self) -> CT_P:
        return cast(CT_P, self.getparent()

And I do have mechanisms for that sort of thing in the docx and pptx projects. But lxml cannot help me with this because it cannot embody the schema of the XML I'm working with.

I would see using Self as trying to get lxml to help some users with something it inherently is unable to do, and at the expense of disabling other users.

As far as an underlying principle is concerned, I'd say the governing principle is that the type annotations should be truthful (faithfully represent the facts), before they are convenient. There is some leeway for convenience, but promising more than the annotated code can deliver will generally exclude use-cases and therefore users.

abelcheung commented 6 months ago

When you provided this absurd example:

<thing>
  <thing>abc</thing>
  <thing>
    def
    <thing>ghi</thing>
  </thing>
</thing>

I immediately realized we have discrepancy in terms and understanding of situation. There are lots of opinions I want to rebuke, but it's a pure waste of time. Let me only rephrase what I intended to express originally. In my understand there are 3 major situation about element class lookup usage:

1) No lookup at all, just simple parsing of xml document into built-in lxml elements, or construction of tree from vanilla lxml element builder. 2) Single subclassing of element, so the whole tree consists of elements of a single PYTHON type. 3) Multiple subclassing so that different tags map to different python classes; or a mixture of custom subclasses and vanilla class. In this situation input argument and returned elements should use their common superclass.

I tried to combine (1) and (2), and you convinced me I haven't been paying attention to (3). The solution I come up so far, is to create 2 packages in every release, one being (1+2) as originally have been doing, and a new package of (1+3), so that you can pick the latter one.

If you have no objection to that, I'll finish that by next release, hopefully on June or July. It's definitely after May, as I have personal matters that need immediate attention during the whole month.

scanny commented 6 months ago

Ah, yes, I see, apologies, I misunderstood what you meant by "homogenous".

I agree with your case-analysis. I'm sure using a single custom element-class is a common case.

I'm okay with a separate package. As long as there is a package that does the needful it doesn't matter much to me what it is named.

One thought occurs to me that might be worth considering, as a way to avoid a "branched" library:

In the single custom-element-class case, one way to avoid casting every return value from methods like .insert() and .getparent() would be for the user to override whichever of those methods they use in the custom element class like so:

class CustomElementClass:
    def getparent(self) -> CustomElementClass:
        return super().getparent()

In my own experience, while lxml.etree offers a great many methods, in my code I invariably only use a handful of them. So an approach like this would not seem onerous to me for my own code.

For my purposes, I need to cast from etree._Element to a more specific type on almost every lxml._Element method call, and that doesn't bother me. Each case is different and relies on knowledge of the schema so there's generally no avoiding it in my case.

Anyway, something to consider :)

abelcheung commented 5 months ago

@scanny Do you have any idea for new tarball name? I'm going for something like types-lxml-hybrid or types-lxml-multi-subclass, if you have better name, there's still chance to suggest.

Next batches of changes will conflict with main branch so I'm merging sooner than normal.

scanny commented 5 months ago

@abelcheung I think the latter name types-lxml-multi-subclass is more descriptive, so I'd have a mild preference for that :)

abelcheung commented 3 months ago

@scanny Sorry for the long wait, package is available for testing now.

scanny commented 2 months ago

Thanks for this @abelcheung it seems to work great! Thanks for all your work on this :)

Just having a browse through, this is definitely a "rich" interface :) There's a lot more to it than I've had occasion to use in my projects :)