Using an existing DOMDocument or subclass as target for parsed data

_PHPPowertools/DOM-Query is the first component of the **PHPPowertools_ framework that has been released to the public. It's purpose is similar to that of technosophos/querypath** but it's implementation is far more true to both jQuery's syntax and its semantics. For example, _PHPPowertools/DOM-Query_ lets you do stuff like this :

// Add a span tag with classes 'icon' and 'icon-printer' to all buttons
$H->select('body')->select('button')->add('span')->addClass('icon icon-printer');

// Use a lambda function to set the data-val attribute of all gallery images
$H->select('.gallery li img')->attr('data-val', function( $i, $val) {
    return $i . " - " . $val->attr('class') . " - photo by Kelly Clark";
});

What's lacking so far, is proper support for HTML5. I've been considering using _Masterminds/html5-php_ to do the DOM parsing.

The most elegant way to implement the feature, would be by adding a target option to the supported options for \Masterminds\HTML5\Parser\DOMTreeBuilder::__construct with support for following datatypes :

\DOMDocument or subclasses of \DomDocument
\DOMImplementation or subclasses of \DOMImplementation

I would like to use this feature as follows :

namespace PowerTools;

use \Symfony\Component\CssSelector\CssSelector as CssSelector;
use \Masterminds\HTML5 as HTML5;

class DOM_Document extends \DOMDocument {

    protected $_isHTML = false;

    public function __construct($data = false, $version = null, $encoding = null) {
        parent::__construct($version, $encoding);
        $data = trim($data);
        if ($data && $data != '') {
            if ($this->_isHTML) {
                $html5 = new HTML5();
                @$html5->loadHTML($data, array('target' => $this));
            } else {
                @$this->loadXML($data);
            }
        }
    }

   [ ... ]
}

I've tried adding a simple if(){}else{} statement to \Masterminds\HTML5\Parser\DOMTreeBuilder::__construct to replace $this->doc with $options['target'] if a value for $options['target'] has been set, but that doesn't seem to do it.

As an alternative, I've also considered reïmplementing \PowerTools\DOM_Document as a subclass of \DOMImplementation, but this is a far less elegant approach that introduces too many new issues to go any further in that area.

Any feedback would be appreciated!

Hi! In my opinion, it is always a bad idea trying to extend the dom document classes.

We should have learned this with prototype.js. To avoid all problems with dom extension, jQuery decided to wrap always in a class.

$("span") is just a shortcut to $("span", window.document). To port the same thing, the right approach would be new DOMQuery($domDocumentInstance).

the advantages of this approach are:

$domDocumentInstance can be obtained in a wide range of ways (using html5, php domdocument, etc)
just the dom manipulation logic is contained into DOMQuery class.
this approach separates problems, allowing a better integration between different tools.

@goetas :

Actually, the approach you're suggesting _is_ the approach that I'm taking.

The project makes use of the following classes :

\PowerTools\DOM_Document : This subclass of \DOMDocument merely adds the ECMAScript 5 methods querySelector and querySelectorAll to \DOMDocument (used for DOM selection) and improves the methods for saving to or loading from strings. It contains no DOM manipulation logic.
\PowerTools\DOM_HTML : This subclass of \PowerTools\DOM_Document is merely a convenience abstraction for \PowerTools\DOM_Document with no realy differences. So it also contains no DOM manipulation logic.
\PowerTools\DOM_XML : This subclass of \PowerTools\DOM_Document is merely a convenience abstraction for \PowerTools\DOM_Document. Right now, the only difference with \PowerTools\DOM_HTML is the value of the flag $_isHTML. So it also contains no DOM manipulation logic.
\PowerTools\DOM_Query : This class contains all the DOM manipulation logic. Running the constructor of of \PowerTools\DOM_Query adds an instance of \PowerTools\DOM_Document to public property $this->DOM. When a method of \PowerTools\DOM_Query returns another \PowerTools\DOM_Query, it merely copies the reference to $this->DOM and thus keeps using the same \PowerTools\DOM_Document to keep overhead as minimal as possible and flexibility as maximal as possible.
\PowerTools\ DOM_Helper : This class contains just a few static helper methods used by \PowerTools\DOM_Query.

1) putting the css query selector inside the document is a bad idea. even the domxpath is a separate class. 2) subclassing is different form wrapping 3) DOM_Query::_construct should take the domdocument as input, instead instantiating the new one

1) putting the css query selector inside the document is a bad idea. even the domxpath is a separate class.

In JavasScript, querySelectorAll and querySelector are methods of the DOMDocument and DOMElement objects. As the purpose of this project is to copy the behavior of jQuery, it makes sense to also copy this behavior.

Also, I see little benefit of not including them inside the document object. The actual CSS selection logic is taken case of by Symfony's CssSelector component, with the querySelectorAll method acting as an abstraction wrapper around it and the querySelector method as an abstraction wrapper around querySelectorAll.

The code of \PHPPowerTools\DOM_Query::querySelectorAll :

    public function querySelectorAll($selector, $contextnode = null) {
        if ($this->_isHTML) {
            CssSelector::enableHtmlExtension();
        } else {
            CssSelector::disableHtmlExtension();
        }
        $xpath = new \DOMXpath($this);
        return $xpath->query(CssSelector::toXPath($selector, 'descendant::'), $contextnode);
    }

The code of \PHPPowerTools\DOM_Query::querySelector :

    public function querySelector($selector, $contextnode = null) {
        $items = $this->querySelectorAll($selector, $contextnode);
        if ($items->length > 0) {
            return $items->item(0);
        }
        return null;
    }

2) subclassing is different form wrapping

Valid point. Where I say "wrapper", I actually mean to say "abstraction". I'll fix that in my previous post.

Note that I don't really distinguish between wrappers and subclasses in cases where these are used for abstracting and simplifying an interface. They're just different ways to achieve the same.

3) DOM_Query::_construct should take the domdocument as input, instead instantiating the new one

Actually, \PHPPowerTools\DOM_Query::_construct accepts both a string or an instance of \PHPPowerTools\DOM_Document as valid input.

The code of \PHPPowerTools\DOM_Query::_construct :

    public function __construct($source, $isHtml = true) {
        if (is_string($source)) {
            if ($isHtml) {
                $this->DOM = new DOM_HTML($source);
                $this->isHtml = true;
            } else {
                $this->DOM = new DOM_XML($source);
                $this->isHtml = false;
            }
        } else {
            $this->DOM = $source;
            $this->isHtml = $isHtml;
        }
        $this->nodes = array($this->DOM);
    }

I see no valid reason for only allowing an instance of \PHPPowerTools\DOM_Document. On the other hand, there are several reasons for allowing both :

It's more flexible
It's more similar to jQuery's way of doing things
It allows users to use DOM-Query without knowing anything about native DOM crawling and manipulation, much like jQuery does for a JavaScript environment

The jQuery way :

// Define your DOMCrawler
$ = jQuery;

// Passing a string (CSS selector)
$s = $( 'div.foo' );

// Passing an element object (DOM Element)
$s = $( document.body );

// Passing a jQuery object
$s = $( $('p + p') );

The DOM-Query way :

namespace PowerTools;

// Get file content
$htmlcode = file_get_contents( 'https://github.com' );

// Define your DOMCrawler based on file string
$H = new DOM_Query( $htmlcode );

// Define your DOMCrawler based on an existing DOM_Query instance
$H = new DOM_Query( $H->select('body') );

// Passing a string (CSS selector)
$s = $H->select( 'div.foo' );

// Passing an element object (DOM Element)
$s = $H->select( $documentBody );

// Passing a DOM Query object
$s = $H->select( $H->select('p + p') );

I fixed the problem.

For this, I needed to two new options.

These options are implemented only in \Masterminds\HTML5\Parser\DOMTreeBuilder and have the following purpose :

implicitHtmlNamespace = Allows the use of createElement instead of createElementNS for HTML elements. This is required for compatibility with \PHPPowertools\DOM-Query and for compatibility with \Symfony\Component\CssSelector\CssSelector.
target = allows an existing DOMDocument (or subclass thereof) to be passsed to the DOMTreeBuilder instead of creating a new one. This option is required for compatibility with \PHPPowertools\DOM-Query

I created a new pull request with the two changes -> https://github.com/Masterminds/html5-php/pull/69

Masterminds / html5-php