Open zmsMarc opened 1 year ago
@zmsMarc thanks for reporting the issue and sorry for the long wait so far; I somehow didn't notice the ticket.
@OLILHR as discussed: Please, first write a unittest to reproduce the issue; Then we should understand what exactly causes the scraper to miss those pages. There are several similar tests available already, e.g. here. You don't need to write a full equality check for this overly long EBD, better something like: "check if those specific steps are contained in the result".
Hey @hf-kklein,
just wanted to update that with the latest release (Entscheidungsbaum-Diagramme und Codelisten - informatorische Lesefassung 3.5 Konsolidierte Lesefassung mit Fehlerkorrekturen Stand: 27.03.2024) and ebdamame v.0.1.1 it now skips all checks between 427 and 818 for E_0406 and E_0407.
I don't know how deep you're into the code @marcsst but internally ebdamame (formerly known as ebddocx2table) relies on python-docx which parses the Office/OpenXML of the .docx
files published by edi@energy. This office/OpenXML is usually super bloated and worse to read than any XML I ever saw before. I still wonder how any program is able to display anything meaningful from it 😅
For the EBD documents specifically, each EBD description that spans more than 1 MS Word page usually consists of multiple OpenXML tables. This means: Although e.g. E_0406 looks like 1 giant MS Word table spanning multiple pages, there are >1 OpenXML tables <tbl>
(of which 22 are detected as of now). Sometimes it's 1 tbl per page, sometimes it's different. We need to collect them all and then put them together into one large EbdTable
in the end.
This happens roughly here: https://github.com/Hochfrequenz/ebdamame/blob/672cd495bea22209d47a99829edee27bc100d3fb/src/ebdamame/__init__.py#L118
But somehow, the pages with the rows >=430 are not included and I'm afraid the problem is in python-docx already but I'm not sure if it's a bug or if the docx is just malformed.
The OpenXML tables contain a header and a body. If we check the XML of the table that starts at position 420, its header looks like this:
<tbl>
<tblPr>
<tblStyle w:val="Tabellenraster" />
<tblW w:w="14265" w:type="dxa" />
<tblLayout w:type="fixed" />
<tblLook w:val="04A0" w:firstRow="1" w:lastRow="0" w:firstColumn="1" w:lastColumn="0" w:noHBand="0" w:noVBand="1" />
</tblPr>
<tblGrid>
<gridCol w:w="703" />
<gridCol w:w="6058" />
<gridCol w:w="1581" />
<gridCol w:w="854" />
<gridCol w:w="5069" />
</tblGrid>
<!-- 420 Ist die Artikel-ID für diesen Rechnungstypen für diesen Positionszeitraum zulässig? follows inside this tbl -->
Note that there are 5 columns (<gridCol />
) defined. This is what we expect and also what we see when we open the file in Word:
But if we go to the next page (the first page from which the <tbl>
s are not included in the result anymore), although the table still looks the same (has 5 columns)
there are way to many columns defined in the xml:
<tbl>
<tblPr>
<tblStyle w:val="Tabellenraster" />
<tblW w:w="14327" w:type="dxa" />
<tblLayout w:type="fixed" />
<tblLook w:val="04A0" w:firstRow="1" w:lastRow="0" w:firstColumn="1" w:lastColumn="0" w:noHBand="0" w:noVBand="1" />
</tblPr>
<tblGrid>
<gridCol w:w="704" />
<gridCol w:w="6035" />
<gridCol w:w="21" />
<gridCol w:w="7" />
<gridCol w:w="9" />
<gridCol w:w="17" />
<gridCol w:w="1399" />
<gridCol w:w="126" />
<gridCol w:w="26" />
<gridCol w:w="10" />
<gridCol w:w="819" />
<gridCol w:w="41" />
<gridCol w:w="7" />
<gridCol w:w="5042" />
<gridCol w:w="47" />
<gridCol w:w="17" />
</tblGrid>
<!-- 430 Gibt es mehr als eine Position mit dieser Artikel-ID? follows somewhere here -->
And python-docx refuses to parse this probably because there are more <gridCol/>
s than columns <tc>
inside 1 <tr>
row.
Probably there is a workaround, but so far I wasn't able to find it. The problem is reproducable in a simple test: https://github.com/Hochfrequenz/ebdamame/pull/186
When using ebddocx2table to convert the E_0406 EBD to json it skips exactly 2 pages from the input file (104 & 105). This leads to steps 653 to 676 missing from the output.
File used: Entscheidungsbaum-Diagramme und Codelisten - informatorische Lesefassung 3.5
https://www.edi-energy.de/index.php?id=38&tx_bdew_bdew%5Buid%5D=2110&tx_bdew_bdew%5Baction%5D=download&tx_bdew_bdew%5Bcontroller%5D=Dokument&cHash=cb73398426723e4ee1f49cc8a71eb3a6
Settings in code: