FriendsOfPHP / Goutte

Goutte, a simple PHP Web Scraper
MIT License
9.26k stars 1.01k forks source link

Parsing error #388

Closed sachaw closed 4 years ago

sachaw commented 4 years ago

The following html does not seem to be parsing correctly:

<div id="titlebar"><div id="helpbutton">?</div><h2><a href="min-1246.html">Defernite</a> : <span style='font-size:smaller'>Ca<sub>6</sub>(CO<sub>3</sub>)<sub>2-x</sub>(SiO<sub>4</sub>)<sub>x</sub>(OH)<sub>7</sub>(Cl,OH)<sub>1-2x</sub>  (x<0.5)</span>, <a href="min-1856.html">Hematite</a> : <span style='font-size:smaller'>Fe<sub>2</sub>O<sub>3</sub></span><div class='titleloc'><a href="loc-2427.html">Kombat Mine, Kombat, Grootfontein, Otjozondjupa Region, Namibia</a></div></h2>   </div>

When I run:

$minerals = $crawler->filter('body > div > div#titlebar > h2')->each(function($mineral) use (&$mineral_ids) {
            print_r($mineral->text());
}

It only returns one element when there should be 2. The error seems to be Goutte not detecting the first <span> closing.

Reference URL: https://www.mindat.org/photo-804.html

Thanks.

larowlan commented 4 years ago

The HTML is invalid, x<0 should be x&lt;0 the < is seen as an opening tag

larowlan commented 4 years ago

Picked up via http://validator.w3.org/

sachaw commented 4 years ago

ok, thanks. Ill try and create a workaround