Rogdham / bigxml

Parse big xml files and streams with ease
https://bigxml.rogdham.net/
MIT License
23 stars 3 forks source link

Combining different dataclasses for nested XML-structures? #5

Closed sskagemo closed 1 year ago

sskagemo commented 1 year ago

Thank you for a great tool!

I am trying to write a more efficient way of extracting data from this file: https://github.com/Skatteetaten/saf-t/blob/master/Example%20Files/ExampleFile%20SAF-T%20Financial_999999999_20161125213512.xml

Starting on line 314 are the Transaction-elements I'm interested in. But each Transactions have a set of Line-elements, I need to flatten the output, having one record for each Line, but each of these lines must repeat all the details for the transaction they are part of.

I tried making a dataclass for Transaction, and manage to get that working. But I haven't found a solving the Lines-bit. I tried by defining a Lines-dataclass, and a Lines-handler like this in the Transactions-dataclass:

@xml_handle_element("Line")
def handle_Line(self, node):
  lines = [item for item in node.iter_from(Line)]
  print(lines)

Instead of getting two or three Lines-elements, I get 12, and most of the attributes are with the default values from the dataclass (N/A or 0)

Line(RecordID='N/A', AccountID='N/A', ValueDate='N/A', SourceDocumentID='N/A', SupplierID='N/A', Description='N/A', DebitAmount=0.0, CreditAmount=0.0, ReferenceNumber='N/A', TaxType='N/A', TaxCode='N/A', TaxPercentage=0, TaxBase=0.0, TaxAmount=0.0)

I've read the documentation thorougly, including trying to understand if there is some way to benefit from the "syntactic sugar"-part, but I have to admit that I don't really understand it ... sorry!

I am not very experienced in Python, so apologies for asking a stupid question here. For more context, I am more or less trying to achieve what is described in this post, but without the MS-tools: https://blogs.sap.com/2022/09/30/big-xml-file-flattening-with-excel-power-query-for-saf-t-and-other-requirements/

sskagemo commented 1 year ago

I've cut and pasted the code wrote from the files and working in, and I've tried a lot of changes, so can't be 100 % sure if this is the code that actually ran, but I guess you get the idea ...

Click here for the code

```python from dataclasses import dataclass, asdict from bigxml import Parser, xml_handle_element, xml_handle_text @xml_handle_element("AuditFile", "GeneralLedgerEntries", "Journal", "Transaction") @dataclass class Transaction: TransactionID: str = "N/A" Period: str = "N/A" PeriodYear: str = "N/A" TransactionDate: str = "N/A" TransactionType: str = "N/A" Description: str = "N/A" SystemEntryDate: str = "N/A" GLPostingDate: str = "N/A" lines: list = None @xml_handle_element("TransactionID") def handle_TransactionID(self, node): self.TransactionID = node.text @xml_handle_element("Period") def handle_Period(self, node): self.Period = node.text @xml_handle_element("PeriodYear") def handle_PeriodYear(self, node): self.PeriodYear = node.text @xml_handle_element("TransactionDate") def handle_TransactionDate(self, node): self.TransactionDate = node.text @xml_handle_element("TransactionType") def handle_TransactionType(self, node): self.TransactionType = node.text @xml_handle_element("Description") def handle_Description(self, node): self.Description = node.text @xml_handle_element("SystemEntryDate") def handle_SystemEntryDate(self, node): self.SystemEntryDate = node.text @xml_handle_element("GLPostingDate") def handle_GLPostingDate(self, node): self.GLPostingDate = node.text @xml_handle_element("Line") def handle_Line(self, node): line = [item for item in node.iter_from(Line)] self.lines = [line] if self.lines == None else self.lines.append(line) @xml_handle_element("Line") @dataclass class Line: # Ignores the Analysis-elements for now RecordID: str = "N/A" AccountID: str = "N/A" ValueDate: str = "N/A" SourceDocumentID: str = "N/A" SupplierID: str = "N/A" Description: str = "N/A" DebitAmount: float = 0.0 CreditAmount: float = 0.0 ReferenceNumber: str = "N/A" TaxType: str = "N/A" # Assumption that there is only one TaxInformation-element pr line TaxCode: str = "N/A" TaxPercentage: int = 0 TaxBase: float = 0.0 TaxAmount: float = 0.0 @xml_handle_element("RecordID") def handle_RecordID(self, node): self.RecordID = node.text @xml_handle_element("AccountID") def handle_AccountID(self, node): self.AccountID = node.text @xml_handle_element("ValueDate") def handle_ValueDate(self, node): self.ValueDate = node.text @xml_handle_element("SourceDocumentID") def handle_SourceDocumentID(self, node): self.SourceDocumentID = node.text @xml_handle_element("SupplierID") def handle_SupplierID(self, node): self.SupplierID = node.text @xml_handle_element("Description") def handle_Description(self, node): self.Description = node.text @xml_handle_element("DebitAmount") def handle_DebitAmount(self, node): self.DebitAmount = float(node.text) # Will automatically go one level deeper to get the value @xml_handle_element("CreditAmount") def handle_CreditAmount(self, node): self.CreditAmount = float(node.text) # Will automatically go one level deeper to get the value @xml_handle_element("ReferenceNumber") def handle_ReferenceNumber(self, node): self.ReferenceNumber = node.text @xml_handle_element("TaxInformation") def handle_TaxInformation(self, node): yield from node.iter_from(self.handle_TaxType, self.handle_TaxCode, self.handle_TaxPercentage, self.handle_TaxBase, self.handle_TaxAmount) @xml_handle_element("TaxType") def handle_TaxType(self, node): self.TaxType = node.text @xml_handle_element("TaxCode") def handle_TaxCode(self, node): self.TaxCode = node.text @xml_handle_element("TaxPercentage") def handle_TaxPercentage(self, node): self.TaxPercentage = int(node.text) @xml_handle_element("TaxBase") def handle_TaxBase(self, node): self.TaxBase = float(node.text) @xml_handle_element("TaxAmount") def handle_TaxAmount(self, node): self.TaxAmount = float(node.text) if __name__ == '__main__': with open("../../testdata/ExampleFile_SAF-T_Financial_888888888_20180228235959.xml", "rb") as f: for item in Parser(f).iter_from(Transaction): print(item) break # To avoid too much output ... ```

Rogdham commented 1 year ago

Hello @sskagemo, glad you too la look into the library and its documentation! I agree your usecase is quite difficult to handle right now, but will be taken into consideration for a future change of the library.

For now you will need to do the following trick:

You will find below your code modified so that you have a better idea of what I mean. I also took this opportunity to improve the following points:

Click here for the code

```python from dataclasses import dataclass, asdict, field from bigxml import Parser, xml_handle_element, xml_handle_text @xml_handle_element("Line") @dataclass class Line: # Ignores the Analysis-elements for now RecordID: str = "N/A" AccountID: str = "N/A" ValueDate: str = "N/A" SourceDocumentID: str = "N/A" SupplierID: str = "N/A" Description: str = "N/A" DebitAmount: float = 0.0 CreditAmount: float = 0.0 ReferenceNumber: str = "N/A" TaxType: str = "N/A" # Assumption that there is only one TaxInformation-element pr line TaxCode: str = "N/A" TaxPercentage: int = 0 TaxBase: float = 0.0 TaxAmount: float = 0.0 @xml_handle_element("RecordID") def handle_RecordID(self, node): self.RecordID = node.text @xml_handle_element("AccountID") def handle_AccountID(self, node): self.AccountID = node.text @xml_handle_element("ValueDate") def handle_ValueDate(self, node): self.ValueDate = node.text @xml_handle_element("SourceDocumentID") def handle_SourceDocumentID(self, node): self.SourceDocumentID = node.text @xml_handle_element("SupplierID") def handle_SupplierID(self, node): self.SupplierID = node.text @xml_handle_element("Description") def handle_Description(self, node): self.Description = node.text @xml_handle_element("DebitAmount") def handle_DebitAmount(self, node): self.DebitAmount = float(node.text) # Will automatically go one level deeper to get the value @xml_handle_element("CreditAmount") def handle_CreditAmount(self, node): self.CreditAmount = float(node.text) # Will automatically go one level deeper to get the value @xml_handle_element("ReferenceNumber") def handle_ReferenceNumber(self, node): self.ReferenceNumber = node.text @xml_handle_element("TaxInformation", "TaxType") def handle_TaxType(self, node): self.TaxType = node.text @xml_handle_element("TaxInformation", "TaxCode") def handle_TaxCode(self, node): self.TaxCode = node.text @xml_handle_element("TaxInformation", "TaxPercentage") def handle_TaxPercentage(self, node): self.TaxPercentage = int(node.text) @xml_handle_element("TaxInformation", "TaxBase") def handle_TaxBase(self, node): self.TaxBase = float(node.text) @xml_handle_element("TaxInformation", "TaxAmount") def handle_TaxAmount(self, node): self.TaxAmount = float(node.text) @xml_handle_element("AuditFile", "GeneralLedgerEntries", "Journal", "Transaction") @dataclass class Transaction: TransactionID: str = "N/A" Period: str = "N/A" PeriodYear: str = "N/A" TransactionDate: str = "N/A" TransactionType: str = "N/A" Description: str = "N/A" SystemEntryDate: str = "N/A" GLPostingDate: str = "N/A" lines: list = field(default_factory=list) @xml_handle_element("TransactionID") def handle_TransactionID(self, node): self.TransactionID = node.text @xml_handle_element("Period") def handle_Period(self, node): self.Period = node.text @xml_handle_element("PeriodYear") def handle_PeriodYear(self, node): self.PeriodYear = node.text @xml_handle_element("TransactionDate") def handle_TransactionDate(self, node): self.TransactionDate = node.text @xml_handle_element("TransactionType") def handle_TransactionType(self, node): self.TransactionType = node.text @xml_handle_element("Description") def handle_Description(self, node): self.Description = node.text @xml_handle_element("SystemEntryDate") def handle_SystemEntryDate(self, node): self.SystemEntryDate = node.text @xml_handle_element("GLPostingDate") def handle_GLPostingDate(self, node): self.GLPostingDate = node.text handle_line = Line def xml_handler(self, iterator): for item in iterator: if isinstance(item, Line): self.lines.append(item) else: raise NotImplementedError # should not happen yield self if __name__ == '__main__': with open("../../testdata/ExampleFile_SAF-T_Financial_888888888_20180228235959.xml", "rb") as f: for item in Parser(f).iter_from(Transaction): print(item) break # To avoid too much output ... ```

Tell me if that works for you!

sskagemo commented 1 year ago

It worked! Thank you very much for helping me! And maybe most importantly, for not making me feel totally useless, by your kind comment:

I agree your usecase is quite difficult to handle right now,

:-)

Rogdham commented 1 year ago

Very good! I'm closing this issue now since it seems to be solved, but I will try to remember to ping you whenever an easier way to do it will be released.