dapper91 / pydantic-xml

python xml for humans
https://pydantic-xml.readthedocs.io
The Unlicense
141 stars 14 forks source link

wrapped? #175

Closed CholoTook closed 3 months ago

CholoTook commented 3 months ago

Hi, Thank you for the great product and the good documentation. After a lot of reading, I'm finally starting to get to grips with the code ;-)

The example shows things like default_factory=list down in the element... I'm having trouble making lists of elements under placeholders work (for some reason)...

I'm wondering what I'm missing, and so I'm looking for docs about the default_factory...

https://pydantic-xml.readthedocs.io/en/latest/search.html?q=default_factory&check_keywords=yes&area=default

^^ A bit beyond me...

Here is an example of the XML I'd like to model:

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.2/style/exchange.xsl"?>
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink">
    <exchange-documents>
        <exchange-document system="ops.epo.org" family-id="90141366" country="WO" doc-number="2024054343" kind="A2">
            <bibliographic-data>
                <publication-reference>
                    <document-id document-id-type="docdb">
                        <country>WO</country>
                        <doc-number>2024054343</doc-number>
                        <kind>A2</kind>
                        <date>20240314</date>
                    </document-id>
                    <document-id document-id-type="epodoc">
                        <doc-number>WO2024054343</doc-number>
                        <date>20240314</date>
                    </document-id>
                </publication-reference>
                <classifications-ipcr>
                    <classification-ipcr sequence="1">
                        <text>G06Q  20/    38            A I                    </text>
                    </classification-ipcr>
                    <classification-ipcr sequence="2">
                        <text>H04L   9/    36            A I                    </text>
                    </classification-ipcr>
                </classifications-ipcr>
                <patent-classifications>
                    <patent-classification sequence="1">
                        <classification-scheme office="EP" scheme="CPCI"/>
                        <section>G</section>
                        <class>06</class>
                        <subclass>Q</subclass>
                        <main-group>20</main-group>
                        <subgroup>065</subgroup>
                        <classification-value>I</classification-value>
                        <generating-office>US</generating-office>
                    </patent-classification>
                </patent-classifications>
                <application-reference doc-id="607944040">
                    <document-id document-id-type="docdb">
                        <country>US</country>
                        <doc-number>2023030506</doc-number>
                        <kind>W</kind>
                    </document-id>
                    <document-id document-id-type="epodoc">
                        <doc-number>WO2023US30506</doc-number>
                        <date>20230817</date>
                    </document-id>
                    <document-id document-id-type="original">
                        <doc-number>US2023/030506</doc-number>
                    </document-id>
                </application-reference>
                <priority-claims>
                    <priority-claim sequence="1" kind="national">
                        <document-id document-id-type="epodoc">
                            <doc-number>US202217941850</doc-number>
                            <date>20220909</date>
                        </document-id>
                        <document-id document-id-type="original">
                            <doc-number>17/941,850</doc-number>
                        </document-id>
                    </priority-claim>
                </priority-claims>
                <parties>
                    <applicants>
                        <applicant sequence="1" data-format="epodoc">
                            <applicant-name>
                                <name>PAYPAL INC [US]</name>
                            </applicant-name>
                        </applicant>
                        <applicant sequence="1" data-format="original">
                            <applicant-name>
                                <name>PAYPAL, INC</name>
                            </applicant-name>
                        </applicant>
                    </applicants>
                    <inventors>
                        <inventor sequence="1" data-format="epodoc">
                            <inventor-name>
                                <name>RIVA BEN [US]</name>
                            </inventor-name>
                        </inventor>
                        <inventor sequence="2" data-format="epodoc">
                            <inventor-name>
                                <name> PURANDARE SUJAY VIJAY [US]</name>
                            </inventor-name>
                        </inventor>
                        <inventor sequence="1" data-format="original">
                            <inventor-name>
                                <name>RIVA, Ben, </name>
                            </inventor-name>
                        </inventor>
                        <inventor sequence="2" data-format="original">
                            <inventor-name>
                                <name>PURANDARE, Sujay Vijay</name>
                            </inventor-name>
                        </inventor>
                    </inventors>
                </parties>
                <invention-title lang="fr">PROCÉDÉS ET SYSTÈMES POUR FACILITER LE PARTAGE DE JETONS DANS DES TRANSACTIONS DE CHAÎNE DE BLOCS</invention-title>
                <invention-title lang="en">METHODS AND SYSTEMS FOR FACILITATING SHARING OF TOKENS IN BLOCKCHAIN TRANSACTIONS</invention-title>
            </bibliographic-data>
            <abstract lang="en">
                <p>A framework ...</p>
            </abstract>
            <abstract lang="fr">
                <p>Un cadre ...</p>
            </abstract>
        </exchange-document>
    </exchange-documents>
</ops:world-patent-data>

Here is what I have so far (still not working, but getting closer...):

from pydantic import field_validator
from pydantic_xml import BaseXmlModel, attr, element, wrapped
from typing import List, Dict, Literal, Optional

# For debugging
from lxml.etree import _Element as Element

NSMAP = {
    # Don't forget the default namespace!
    # You wouldn't want to spend 12 hours debugging this, would you?
    "": "http://www.epo.org/exchange",
    "ops": "http://ops.epo.org",
    "xlink": "http://www.w3.org/1999/xlink",
}

class DocumentId(BaseXmlModel, nsmap=NSMAP):
    """The document-id element. This is used in both search results and published data.

    TODO:
    * If type is 'docdb', there is country, num, kind.
    * If type is 'epodoc', there is no country, but there is a date.
    * If type is 'original', there is only a num (with a specific format to validate).

    """

    document_id_type: Literal["docdb", "epodoc", "original"] = attr(
        name="document-id-type"
    )
    country: Optional[str] = element(default=None)
    doc_number: str = element(tag="doc-number")
    kind: Optional[Literal["A", "A1", "A2", "B1", "W"]] = element(default=None)
    date: Optional[str] = element(default=None)

    @field_validator("country")
    def validate_country(cls, value: str) -> str:
        if len(value) > 2:
            raise ValueError("country must be of 2 characters")
        return value

# TODO: Created for both search results and published data...
class PublicationReference(BaseXmlModel):
    system: Optional[str] = attr(default=None)
    family_id: Optional[str] = attr(name="family-id", default=None)
    document_id: DocumentId = element(tag="document-id")

class SearchResult(BaseXmlModel, ns="ops", nsmap=NSMAP):
    publication_reference: List[PublicationReference] = element(
        tag="publication-reference"
    )

class BiblioSearch(BaseXmlModel, ns="ops", nsmap=NSMAP):
    total_results_count: int = attr(name="total-result-count")
    # TODO: Query text?
    query: Dict[str, str] = element()
    range: Dict[str, int] = element()
    search_result: SearchResult = element(tag="search-result")

class WorldPatentData(BaseXmlModel, tag="world-patent-data", ns="ops", nsmap=NSMAP):
    biblio_search: BiblioSearch = element(tag="biblio-search")

    @property
    def publications(self):
        return self.biblio_search.search_result.publication_reference

#
# THIS IS THE published_datas/publication model...
#

class ClassificationIpcr(BaseXmlModel, nsmap=NSMAP):
    sequence: int = attr()
    text: str = element()

class ClassificationsIpcr(BaseXmlModel, nsmap=NSMAP):
    classification_ipcr: List[ClassificationIpcr] = element(tag="classification-ipcr")

class PatentClassification(BaseXmlModel, nsmap=NSMAP):
    sequence: int = attr()
    classification_scheme: Dict[str, str] = element(tag="classification-scheme")
    section: str = element()
    class_: str = element(tag="class")
    subclass: str = element(tag="subclass")
    main_group: str = element(tag="main-group")
    subgroup: str = element(tag="subgroup")
    classification_value: str = element(tag="classification-value")
    generating_office: str = element(tag="generating-office")

class PatentClassifications(BaseXmlModel, nsmap=NSMAP):
    patent_classification: List[PatentClassification] = element(
        tag="patent-classification"
    )

class Name(BaseXmlModel, nsmap=NSMAP):
    name: str = element()

class TextWithLang(BaseXmlModel, nsmap=NSMAP):
    lang: str = attr()
    text: str = element()

class Applicant(BaseXmlModel, nsmap=NSMAP, arbitrary_types_allowed=True):
    # applicant: Element = element(exclude=True)

    sequence: str = attr(name="sequence")
    # data_format: str = attr(name="data-format")
    # applicant_name: Name = element(tag="applicant-name")

class Inventor(BaseXmlModel, nsmap=NSMAP):
    sequence: str = attr()
    data_format: str = attr(name="data-format")
    inventor_name: Name = element(tag="inventor-name")

class Parties(BaseXmlModel, nsmap=NSMAP):
    applicants: List[Applicant] = element()
    # inventors: List[Inventor] = element()

class ApplicationReference(BaseXmlModel, nsmap=NSMAP):
    doc_id: str = attr(name="doc-id")
    document_id: List[DocumentId] = element(tag="document-id")

class PriorityClaim(BaseXmlModel, nsmap=NSMAP):
    sequence: str = attr()
    kind: str = attr()
    document_id: List[DocumentId] = element(tag="document-id")

class BibliographicData(BaseXmlModel, nsmap=NSMAP):
    publication_reference: List[PublicationReference] = element(
        tag="publication-reference"
    )
    classifications_ipcr: ClassificationsIpcr = element(tag="classifications-ipcr")

    patent_classifications: PatentClassifications = element(
        tag="patent-classifications"
    )

    application_reference: ApplicationReference = element(tag="application-reference")
    priority_claims: List[PriorityClaim] = wrapped(
        "priority-claims", element(tag="priority-claim")
    )
    parties: Parties = element()
    # invention_title: List[TextWithLang] = element(tag="invention-title")
    # abstract: List[TextWithLang] = element(tag="abstract")

class ExchangeDocument(BaseXmlModel, nsmap=NSMAP):
    system: str = attr()
    family_id: str = attr(name="family-id")
    country: str = attr()
    doc_number: str = attr(name="doc-number")
    kind: str = attr()
    bibliographic_data: BibliographicData = element(tag="bibliographic-data")

class ExchangeDocuments(BaseXmlModel, ns="", nsmap=NSMAP):
    exchange_document: List[ExchangeDocument] = element(tag="exchange-document")

# TODO: The base model should allow different types of documents
class WorldPatentDataGAH(BaseXmlModel, tag="world-patent-data", ns="ops", nsmap=NSMAP):
    exchange_documents: ExchangeDocuments = element(tag="exchange-documents")
CholoTook commented 3 months ago

I kept banging my head on this and finally solved all my problems :-)

What was confusing me about wrapped was the 'implied' element matching (using pydantic field names).

e.g. I was trying to write things like:

class Applicant(BaseXmlModel, nsmap=NSMAP):
    sequence: str = attr(name="sequence")
    data_format: str = attr(name="data-format")
    applicant_name: Name = element(tag="applicant-name")

class Parties(BaseXmlModel, nsmap=NSMAP):
    applicants: List[Applicant] = wrapped("applicant", element())

but the correct code is:

class Applicant(BaseXmlModel, nsmap=NSMAP):
    sequence: str = attr(name="sequence")
    data_format: str = attr(name="data-format")
    applicant_name: Name = element(tag="applicant-name")

class Parties(BaseXmlModel, nsmap=NSMAP):
    applicant: List[Applicant] = wrapped("applicants", element())

Many thanks again for the great project!