PHPOffice / PHPWord

A pure PHP library for reading and writing word processing documents
https://phpoffice.github.io/PHPWord/
Other
7.27k stars 2.7k forks source link

Reading table content from docx gives half or previous text #2050

Open sotirelisc opened 3 years ago

sotirelisc commented 3 years ago

I'm using the following code to read the content of a table in a docx file, but some columns return half text or text that was previously there. Tracking changes is off by default.

$phpWord = IOFactory::createReader('Word2007')->load($file);

$index = 0;
$rows = [];

$sections = $phpWord->getSections();

foreach ($sections[0]->getElements() as $el) {
    if ($el instanceof PhpOffice\PhpWord\Element\Table) {
        foreach ($el->getRows() as $row) {
            $columns = [];
            foreach ($row->getCells() as $cell) {
                foreach ($cell->getElements() as $cEl) {
                    if ($cEl instanceof PhpOffice\PhpWord\Element\Text) {
                        $columns[] = $cEl->getText();
                    } else if ($cEl instanceof PhpOffice\PhpWord\Element\TextRun) {
                        if (count($cEl->getElements())>0 and $cEl->getElements()[0] instanceof PhpOffice\PhpWord\Element\Text) {
                            $columns[] = $cEl->getElements()[0]->getText();
                        }
                    } else {
                        $columns[] = "";
                    }
                }
            }

            $rows[] = $columns;

            $index++;
        }
    }
}

Is there a better way to read the table or is this an issue with the lib?

melino commented 8 months ago

I have the same issue. Looking at the docx, the cell contains the text "5.507,63". When I unpack the docx and look at the document.xml, I see that it's actually

<w:r><w:rPr><w:rFonts w:ascii="Arial"/><w:b/><w:sz w:val="20"/></w:rPr><w:t>5.</w:t></w:r><w:r w:rsidR="00114EBE"><w:rPr><w:rFonts w:ascii="Arial"/><w:b/><w:sz w:val="20"/></w:rPr><w:t>507,63</w:t></w:r>

I don't know how this happened in the creation of the docx, but in my case it's obviously not a issue of PHPWord. Maybe Word interpreted the dot as "end of sentence" and put some "dirt" after it and before the 507,63.

Progi1984 commented 3 months ago

@sotirelisc Hi have you got a sample file, please ?