PHPOffice / PHPWord

A pure PHP library for reading and writing word processing documents
https://phpoffice.github.io/PHPWord/
Other
7.25k stars 2.69k forks source link

"MsDoc" reader fails to open and/or correctly process MS Word 97-2003 (*.doc) files #1318

Open voltel opened 6 years ago

voltel commented 6 years ago

This is:

Expected Behavior

1) The MS Word 97-2003 document (*.doc) would be correctly opened and correctly processed by $phpWord = IOFactory::load($c_file_name, 'MsDoc'); // this line causes error

2) styles would be internally set in MsDoc.php in generatePhpWord() method:

        foreach ($this->arraySections as $itmSection) {
            $oSection = $this->phpWord->addSection();
            $oSection->setStyle($itmSection->styleSection); // this line causes error
        ...

Current Behavior

Errors, inconsistently different:

Notice: Uninitialized string offset: 327680 (or some other wildly large number) Error traced in getInt2d() and/or getInt1d() of vendor\phpoffice\phpword\src\PhpWord\Reader\MsDoc.php (line 2317)

or

Fatal error: Uncaught PhpOffice\PhpWord\Exception\Exception: Could not open resources/resources/n_466.doc for reading! File does not exist, or it is not readable. in D:\xxx\xxx\vendor\phpoffice\phpword\src\PhpWord\Shared\OLERead.php:78

or

Notice: Undefined property: stdClass::$styleSection traced to vendor\phpoffice\phpword\src\PhpWord\Reader\MsDoc.php generatePhpWord()

or, when it manages to convert some test file, the layout is completely wrong: no styles, line breaks in wrong places, parts of words are missing, table is not reproduced.

the elements recognized by the following snippet are of type Text, with failed recognition of paragraphs. A simple table has not been recognized at all.

Failure Information

I tried all possible versions of MS Word 97-2003 documents (created from MS Word 2007, or in MS Word 365). I tried to process downloaded files (i.e. from here n_466.doc or d466.doc), or I created new files manually in both available to me versions of MS Word (2007 and 365) and saved them as *.doc. The provided set-up (see further) works OK with the same documents saved as .docx files (different reader class). test_documents.zip

Version, copied from the composer.json: "phpoffice/phpword": "^0.14.0",

or form composer.lock: "name": "phpoffice/phpword", "version": "v0.14.0", "source": { "type": "git", "url": "https://github.com/PHPOffice/PHPWord.git", "reference": "b614497ae6dd44280be1c2dda56772198bcd25ae" },

How to Reproduce

This is a part of Symfony 4 project.

Service class:

<?php
namespace App\Service\Parser;

use PhpOffice\PhpWord\Element\{
    Line,
    Section,
    Table,
    Text,
    TextBreak,
    TextRun
};

use PhpOffice\PhpWord\IOFactory;

class DecParser
{
    /**
     * @param string $c_file_name
     * @return array
     * @throws \Exception
     */
    public function get_doc_tables_array(string $c_file_name) : array
    {
        $a_tables = [];

        $readerName = null;
        if (preg_match('/\.(\w*)$/', $c_file_name, $a_matches)) {
            if ($a_matches[1] == 'docx') $readerName = 'Word2007';
            else if ($a_matches[1] == 'doc') $readerName = 'MsDoc';
        }//

        //dump('Reader name: ' . $readerName);
        $phpWord = IOFactory::load($c_file_name, $readerName);
        $a_sections = $phpWord->getSections();

        $table_index = 0;
        foreach ($a_sections as $this_section) {
            foreach ($this_section->getElements() as $el) {

                if ($el instanceof Table) {
                    foreach ($el->getRows() as $row_index => $row) {
                        $a_tables[$table_index][$row_index] = [];
                        foreach ($row->getCells() as $col_index => $cell) {
                            $a_tables[$table_index][$row_index][$col_index] = '';

                            foreach ($cell->getElements() as $cell_el) {
                                $a_tables[$table_index][$row_index][$col_index] .= self::extract_text_from_element($cell_el);
                            }//endforeach

                        }//endforeach
                    }//endforeach

                    $table_index++;
                }//endif
            }//endforeach

        }//endforeach
        return $a_tables;
    }//end of function

    /**
     * @param $el
     * @param int $depth
     * @return null|string
     * @throws \Exception
     */
    private static function extract_text_from_element($el, $depth = 0) :? string
    {
        $c_text = null;

        if ($depth > 100) throw new \Exception("Depth of recursions is over the limit of 100 in " . __METHOD__);

        if ($el instanceof Line) {
            $c_text = "\n\n";

        } else if ($el instanceof TextBreak) {
            $c_text = "\n";

        } else if ($el instanceof Text) {
            $c_text = $el->getText();

        } else if ($el instanceof TextRun) {
            $depth++;
            $a_elements = $el->getElements();

            $c_text = '';
            foreach($a_elements as $this_el) {
                $c_text .= self::extract_text_from_element($this_el, $depth);
            }//endforeach

            if (count($a_elements) > 0 ) {
                $c_text .= "\n";
            }//endif
        }//endif

        return $c_text;
    }//end of function

}//end of class

Controller class:

<?php
namespace App\Controller;

use App\Service\Parser\DecParser;

use Symfony\Bundle\FrameworkBundle\Controller\Controller;
use Symfony\Component\HttpFoundation\Request;
use Symfony\Component\HttpFoundation\Response;
use Symfony\Component\Routing\Annotation\Route;

/**
 * @Route("/parse")
 */
class ParserController extends Controller
{
    /**
     * @Route("/dec")
     * a single argument should be injected as a dependency during controller execution 
     * or you can create a new object of a above service of class  DecParser. 
     */
    public function show_parsed_doc(DecParser $parser) : Response
    {
        $doc_name = '../docs/temp/d466.docx'; // change this to real file location

        $a_tables = $parser->get_doc_tables_array($doc_name);
        $a_template_data = [
            'tables' => $a_tables
        ];

        // edit twig template to visualize the table data - the sample is provided below
        return $this->render('dec/dec_orders.html.twig', $a_template_data);
    }//end of function

}//end of class

Sample implementation of twig template

{% extends "base.html.twig" %}

{% block title %}Parsed tables{% endblock %}

{% block main %}
    {% if tables is defined %}
        {% for this_table in tables %}
            <h2>Table {{ loop.index }}</h2>
            <table class="table table-bordered table-light">
                <tbody>
                {% for row in this_table %}
                    <tr>
                        {% for cell in row %}
                            <td>{{ cell }} </td>
                        {% endfor %}
                    </tr>
                {% endfor %}
                </tbody>
            </table>
        {% endfor %}
    {% endif %}

{% endblock %}

Context

nickpoulos commented 5 years ago

Hey @Progi1984 , any luck with this? I am seeing similar behavior. It seems unable to read a pretty standard Word97 doc, no special formatting. Instead I get broken, fragmented text and/or not getting other sections entirely.

I was able to get much better results from a simple fread style function. But that was only useful for plaintext extraction, no style or formatting data unfortunately.

function readWord($filename) {
        if(file_exists($filename))
        {
            if(($fh = fopen($filename, 'r')) !== false )
            {
                $headers = fread($fh, 0xA00);

                // 1 = (ord(n)*1) ; Document has from 0 to 255 characters
                $n1 = ( ord($headers[0x21C]) - 1 );

                // 1 = ((ord(n)-8)*256) ; Document has from 256 to 63743 characters
                $n2 = ( ( ord($headers[0x21D]) - 8 ) * 256 );

                // 1 = ((ord(n)*256)*256) ; Document has from 63744 to 16775423 characters
                $n3 = ( ( ord($headers[0x21E]) * 256 ) * 256 );

                // 1 = (((ord(n)*256)*256)*256) ; Document has from 16775424 to 4294965504 characters
                $n4 = ( ( ( ord($headers[0x21F]) * 256 ) * 256 ) * 256 );

                // Total length of text in the document
                $textLength = ($n1 + $n2 + $n3 + $n4);

                $extracted_plaintext = fread($fh, $textLength);

                return $extracted_plaintext;
            } else {
                return false;
            }
        } else {
            return false;
        }
    }
woaijiangjing commented 4 years ago

bad

ijohnson-TCR commented 1 year ago

Any updates on this? Trying to convert a .doc file to pdf, it works, but in the pdf part of the text is cut off and the italics are gone.

  require 'vendor/autoload.php';

  use PhpOffice\PhpWord\IOFactory;
  use PhpOffice\PhpWord\Settings;

  Settings::setPdfRendererName(Settings::PDF_RENDERER_DOMPDF);
  Settings::setPdfRendererPath('.');

  $phpWord = IOFactory::load('TEST2.doc', 'MsDoc');
  $phpWord->save('word_doc.pdf', 'PDF');
BarryBravo commented 2 months ago

the same problem, any solutions?

Progi1984 commented 2 months ago

@BarryBravo Hi, Could you give us a sample file which you have this error, please ?

AlexAndriets commented 22 hours ago

@Progi1984 Have problem with reading Wore 97-2003 format.

$pdf_uri = 'pdf.pdf';
$html_uri = 'html.html';
$word = \PhpOffice\PhpWord\IOFactory::load(storage_path($exdoc->filelocation), 'MsDoc');
$writer =  \PhpOffice\PhpWord\IOFactory::createWriter($word, 'HTML');
$writer->save($html_uri);
$pdf = new Dompdf();
$pdf->loadHtml(file_get_contents($html_uri));
$pdf->setPaper('A4', 'portrait');
$pdf->render();
 $output = $pdf->output();
file_put_contents($pdf_uri, $output);
{9802C59B-7E71-4912-B4D8-225D95C8DE98}

Document1_doc.zip

After saving this document as a copy with OpenOffice I have next result:

{72A7A50E-7520-408C-99F0-918B309F76DD}

Document2_doc.zip