Masterminds / html5-php

An HTML5 parser and serializer for PHP.
http://masterminds.github.io/html5-php/
Other
1.58k stars 113 forks source link

Using an existing DOMDocument or subclass as target for parsed data #67

Closed jslegers closed 9 years ago

jslegers commented 9 years ago

_PHPPowertools/DOM-Query is the first component of the **PHPPowertools_ framework that has been released to the public. It's purpose is similar to that of technosophos/querypath** but it's implementation is far more true to both jQuery's syntax and its semantics. For example, _PHPPowertools/DOM-Query_ lets you do stuff like this :

// Add a span tag with classes 'icon' and 'icon-printer' to all buttons
$H->select('body')->select('button')->add('span')->addClass('icon icon-printer');

// Use a lambda function to set the data-val attribute of all gallery images
$H->select('.gallery li img')->attr('data-val', function( $i, $val) {
    return $i . " - " . $val->attr('class') . " - photo by Kelly Clark";
});

What's lacking so far, is proper support for HTML5. I've been considering using _Masterminds/html5-php_ to do the DOM parsing.

The most elegant way to implement the feature, would be by adding a target option to the supported options for \Masterminds\HTML5\Parser\DOMTreeBuilder::__construct with support for following datatypes :

I would like to use this feature as follows :

namespace PowerTools;

use \Symfony\Component\CssSelector\CssSelector as CssSelector;
use \Masterminds\HTML5 as HTML5;

class DOM_Document extends \DOMDocument {

    protected $_isHTML = false;

    public function __construct($data = false, $version = null, $encoding = null) {
        parent::__construct($version, $encoding);
        $data = trim($data);
        if ($data && $data != '') {
            if ($this->_isHTML) {
                $html5 = new HTML5();
                @$html5->loadHTML($data, array('target' => $this));
            } else {
                @$this->loadXML($data);
            }
        }
    }

   [ ... ]
}

I've tried adding a simple if(){}else{} statement to \Masterminds\HTML5\Parser\DOMTreeBuilder::__construct to replace $this->doc with $options['target'] if a value for $options['target'] has been set, but that doesn't seem to do it.

As an alternative, I've also considered reïmplementing \PowerTools\DOM_Document as a subclass of \DOMImplementation, but this is a far less elegant approach that introduces too many new issues to go any further in that area.

Any feedback would be appreciated!

See also https://github.com/PHPPowertools/DOM-Query/issues/1

goetas commented 9 years ago

Hi! In my opinion, it is always a bad idea trying to extend the dom document classes.

We should have learned this with prototype.js. To avoid all problems with dom extension, jQuery decided to wrap always in a class.

$("span") is just a shortcut to $("span", window.document). To port the same thing, the right approach would be new DOMQuery($domDocumentInstance).

the advantages of this approach are:

jslegers commented 9 years ago

@goetas :

Actually, the approach you're suggesting _is_ the approach that I'm taking.

The project makes use of the following classes :

goetas commented 9 years ago

1) putting the css query selector inside the document is a bad idea. even the domxpath is a separate class. 2) subclassing is different form wrapping 3) DOM_Query::_construct should take the domdocument as input, instead instantiating the new one

jslegers commented 9 years ago

1) putting the css query selector inside the document is a bad idea. even the domxpath is a separate class.

In JavasScript, querySelectorAll and querySelector are methods of the DOMDocument and DOMElement objects. As the purpose of this project is to copy the behavior of jQuery, it makes sense to also copy this behavior.

Also, I see little benefit of not including them inside the document object. The actual CSS selection logic is taken case of by Symfony's CssSelector component, with the querySelectorAll method acting as an abstraction wrapper around it and the querySelector method as an abstraction wrapper around querySelectorAll.

The code of \PHPPowerTools\DOM_Query::querySelectorAll :
    public function querySelectorAll($selector, $contextnode = null) {
        if ($this->_isHTML) {
            CssSelector::enableHtmlExtension();
        } else {
            CssSelector::disableHtmlExtension();
        }
        $xpath = new \DOMXpath($this);
        return $xpath->query(CssSelector::toXPath($selector, 'descendant::'), $contextnode);
    }
The code of \PHPPowerTools\DOM_Query::querySelector :
    public function querySelector($selector, $contextnode = null) {
        $items = $this->querySelectorAll($selector, $contextnode);
        if ($items->length > 0) {
            return $items->item(0);
        }
        return null;
    }

2) subclassing is different form wrapping

Valid point. Where I say "wrapper", I actually mean to say "abstraction". I'll fix that in my previous post.

Note that I don't really distinguish between wrappers and subclasses in cases where these are used for abstracting and simplifying an interface. They're just different ways to achieve the same.


3) DOM_Query::_construct should take the domdocument as input, instead instantiating the new one

Actually, \PHPPowerTools\DOM_Query::_construct accepts both a string or an instance of \PHPPowerTools\DOM_Document as valid input.

The code of \PHPPowerTools\DOM_Query::_construct :
    public function __construct($source, $isHtml = true) {
        if (is_string($source)) {
            if ($isHtml) {
                $this->DOM = new DOM_HTML($source);
                $this->isHtml = true;
            } else {
                $this->DOM = new DOM_XML($source);
                $this->isHtml = false;
            }
        } else {
            $this->DOM = $source;
            $this->isHtml = $isHtml;
        }
        $this->nodes = array($this->DOM);
    }

I see no valid reason for only allowing an instance of \PHPPowerTools\DOM_Document. On the other hand, there are several reasons for allowing both :

  1. It's more flexible
  2. It's more similar to jQuery's way of doing things
  3. It allows users to use DOM-Query without knowing anything about native DOM crawling and manipulation, much like jQuery does for a JavaScript environment
The jQuery way :
// Define your DOMCrawler
$ = jQuery;

// Passing a string (CSS selector)
$s = $( 'div.foo' );

// Passing an element object (DOM Element)
$s = $( document.body );

// Passing a jQuery object
$s = $( $('p + p') );
The DOM-Query way :
namespace PowerTools;

// Get file content
$htmlcode = file_get_contents( 'https://github.com' );

// Define your DOMCrawler based on file string
$H = new DOM_Query( $htmlcode );

// Define your DOMCrawler based on an existing DOM_Query instance
$H = new DOM_Query( $H->select('body') );

// Passing a string (CSS selector)
$s = $H->select( 'div.foo' );

// Passing an element object (DOM Element)
$s = $H->select( $documentBody );

// Passing a DOM Query object
$s = $H->select( $H->select('p + p') );
jslegers commented 9 years ago

I fixed the problem.

For this, I needed to two new options.

These options are implemented only in \Masterminds\HTML5\Parser\DOMTreeBuilder and have the following purpose :

I created a new pull request with the two changes -> https://github.com/Masterminds/html5-php/pull/69