Hochfrequenz / ebdamame

Python library to scrape .docx files with "Entscheidungsbaumdiagramm" tables into a truely machine readable structure
GNU General Public License v3.0
0 stars 0 forks source link

ebddocx2table skips 2 pages, E_0406 #133

Open zmsMarc opened 1 year ago

zmsMarc commented 1 year ago

When using ebddocx2table to convert the E_0406 EBD to json it skips exactly 2 pages from the input file (104 & 105). This leads to steps 653 to 676 missing from the output.

File used: Entscheidungsbaum-Diagramme und Codelisten - informatorische Lesefassung 3.5

https://www.edi-energy.de/index.php?id=38&tx_bdew_bdew%5Buid%5D=2110&tx_bdew_bdew%5Baction%5D=download&tx_bdew_bdew%5Bcontroller%5D=Dokument&cHash=cb73398426723e4ee1f49cc8a71eb3a6

Settings in code:


converter = DocxTableConverter(
    docx_tables,
    ebd_key="E_0406",
    chapter="GPKE",
    sub_chapter="6.7: AD: Netznutzungsabrechnung",
)
hf-kklein commented 11 months ago

@zmsMarc thanks for reporting the issue and sorry for the long wait so far; I somehow didn't notice the ticket.

@OLILHR as discussed: Please, first write a unittest to reproduce the issue; Then we should understand what exactly causes the scraper to miss those pages. There are several similar tests available already, e.g. here. You don't need to write a full equality check for this overly long EBD, better something like: "check if those specific steps are contained in the result".

marcsst commented 8 months ago

Hey @hf-kklein,

just wanted to update that with the latest release (Entscheidungsbaum-Diagramme und Codelisten - informatorische Lesefassung 3.5 Konsolidierte Lesefassung mit Fehlerkorrekturen Stand: 27.03.2024) and ebdamame v.0.1.1 it now skips all checks between 427 and 818 for E_0406 and E_0407.

hf-kklein commented 7 months ago

I don't know how deep you're into the code @marcsst but internally ebdamame (formerly known as ebddocx2table) relies on python-docx which parses the Office/OpenXML of the .docx files published by edi@energy. This office/OpenXML is usually super bloated and worse to read than any XML I ever saw before. I still wonder how any program is able to display anything meaningful from it 😅

For the EBD documents specifically, each EBD description that spans more than 1 MS Word page usually consists of multiple OpenXML tables. This means: Although e.g. E_0406 looks like 1 giant MS Word table spanning multiple pages, there are >1 OpenXML tables <tbl> (of which 22 are detected as of now). Sometimes it's 1 tbl per page, sometimes it's different. We need to collect them all and then put them together into one large EbdTable in the end.

This happens roughly here: https://github.com/Hochfrequenz/ebdamame/blob/672cd495bea22209d47a99829edee27bc100d3fb/src/ebdamame/__init__.py#L118

But somehow, the pages with the rows >=430 are not included and I'm afraid the problem is in python-docx already but I'm not sure if it's a bug or if the docx is just malformed.

The OpenXML tables contain a header and a body. If we check the XML of the table that starts at position 420, its header looks like this:

<tbl>
    <tblPr>
        <tblStyle w:val="Tabellenraster" />
        <tblW w:w="14265" w:type="dxa" />
        <tblLayout w:type="fixed" />
        <tblLook w:val="04A0" w:firstRow="1" w:lastRow="0" w:firstColumn="1" w:lastColumn="0" w:noHBand="0" w:noVBand="1" />
    </tblPr>
    <tblGrid>
        <gridCol w:w="703" />
        <gridCol w:w="6058" />
        <gridCol w:w="1581" />
        <gridCol w:w="854" />
        <gridCol w:w="5069" />
    </tblGrid>
    <!-- 420 Ist die Artikel-ID für diesen Rechnungstypen für diesen Positionszeitraum zulässig? follows inside this tbl -->

Note that there are 5 columns (<gridCol />) defined. This is what we expect and also what we see when we open the file in Word: grafik

But if we go to the next page (the first page from which the <tbl>s are not included in the result anymore), although the table still looks the same (has 5 columns) grafik there are way to many columns defined in the xml:

<tbl>
    <tblPr>
        <tblStyle w:val="Tabellenraster" />
        <tblW w:w="14327" w:type="dxa" />
        <tblLayout w:type="fixed" />
        <tblLook w:val="04A0" w:firstRow="1" w:lastRow="0" w:firstColumn="1" w:lastColumn="0" w:noHBand="0" w:noVBand="1" />
    </tblPr>
    <tblGrid>
        <gridCol w:w="704" />
        <gridCol w:w="6035" />
        <gridCol w:w="21" />
        <gridCol w:w="7" />
        <gridCol w:w="9" />
        <gridCol w:w="17" />
        <gridCol w:w="1399" />
        <gridCol w:w="126" />
        <gridCol w:w="26" />
        <gridCol w:w="10" />
        <gridCol w:w="819" />
        <gridCol w:w="41" />
        <gridCol w:w="7" />
        <gridCol w:w="5042" />
        <gridCol w:w="47" />
        <gridCol w:w="17" />
    </tblGrid>
    <!-- 430 Gibt es mehr als eine Position mit dieser Artikel-ID? follows somewhere here -->

And python-docx refuses to parse this probably because there are more <gridCol/>s than columns <tc> inside 1 <tr> row.

Probably there is a workaround, but so far I wasn't able to find it. The problem is reproducable in a simple test: https://github.com/Hochfrequenz/ebdamame/pull/186