FriendsOfPHP / Goutte

Goutte, a simple PHP Web Scraper
MIT License
9.26k stars 1.01k forks source link

Removing nodes #377

Closed duncanmcclean closed 5 years ago

duncanmcclean commented 5 years ago

Hi,

Is there any way using Goutte to remove nodes? I'm building a parser but I need to remove header and footer elements, along with others.

I've briefly looked at the source code for Symfony Crawler and I found the reduce function but I'm unsure on how to use it.

NinoSkopac commented 5 years ago

$node->parentNode->removeChild($node); will do it :)

Hopefully it makes ReadCast an even better software :)

duncanmcclean commented 5 years ago

Do I put that inside a filter @NinoSkopac?

NinoSkopac commented 5 years ago

No, in the callback closure.

On Sun, 21 Apr 2019 at 08:20, Duncan McClean notifications@github.com wrote:

Do I put that inside a filter @NinoSkopac https://github.com/NinoSkopac?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FriendsOfPHP/Goutte/issues/377#issuecomment-485230976, or mute the thread https://github.com/notifications/unsubscribe-auth/AANTPNAUBA22HGDNA7VMNBTPRQITDANCNFSM4HEQMKOA .

duncanmcclean commented 5 years ago

Sorry, I don't understand. Do you mind providing an example?

duncanmcclean commented 5 years ago

I have just attempted to do this using this code:

$crawler
                ->filter('nav, header, #nav, #header')
                ->each(function ($node) use (&$body) {
                    $node->parentNode->removeChild($node);
                });

With that I get the error message, Undefined property: Symfony\Component\DomCrawler\Crawler::$parentNode

stof commented 5 years ago

the argument of the each method is a Crawler object, not a DomElement node. You need to get the node out of it.

NinoSkopac commented 5 years ago

Exactly, typehint it like this: function (Crawler $node) { ...calback code... }

NinoSkopac commented 5 years ago

And you can get the node out of it thus: $node->getNode(0);

duncanmcclean commented 5 years ago

Thanks for your help! Managed to achieve it!

$crawler
                ->filter('head, script, nav, header, #nav, #header, .navbar, #navbar, footer, .footer #foorer, .sidebar, #sidebar, .comments, #comments, .pagination, button')
                ->each(function (Crawler $crawler) {
                    foreach ($crawler as $node) {
                        $node->parentNode->removeChild($node);
                    }
                });