duzun / hQuery.php

An extremely fast web scraper that parses megabytes of invalid HTML in a blink of an eye. PHP5.3+, no dependencies.
https://duzun.me/playground/hquery
MIT License
361 stars 74 forks source link

Non-obvious behavior for ->attr('href') method, if tag <base> are present #95

Closed PavelFil closed 1 month ago

PavelFil commented 2 months ago

This code:

use duzun\hQuery;
$html = '<html>
    <head>
        <base href="https://example.com/" />
    <head>
    <body>
        <a href="#hash">link</a>
    </body>
</html>';
$hQuery = hQuery::fromHTML($html);
$links = $hQuery->find('a');
foreach($links as $link) {
    var_dump($link->attr('href'));
    var_dump($link->href);
}

returns:

string(25) "https://example.com/#hash"
string(25) "https://example.com/#hash"

So there is no way to receive raw href attribute. We must add $hQuery->baseURI(null); after $hQuery = hQuery::fromHTML($html); to prevent links modifying.

I suspect, this is some old functional that can be removed.

duzun commented 2 months ago

Hmmm, this is a very good catch!

I've tried to align hQuery to the browser and jQuery behavior in #48ad78:

This is a potentially breaking change for some code out there.

I've incremented the minor version, though strictly speaking should increase the major version. On the other hand it is no a big change, thus I'm hesitant to increase the major version.