crwlrsoft / crawler

Library for Rapid (Web) Crawler and Scraper Development
https://www.crwlr.software/packages/crawler
MIT License
312 stars 11 forks source link

Fatal error: is not a valid URL #109

Closed flanderboy closed 1 year ago

flanderboy commented 1 year ago

Hello again, i'm try to get articles from a website but i receive this error:

PHP Fatal error:  Uncaught Crwlr\Url\Exceptions\InvalidUrlException: 2023-06-24T19:29:00+02:00 is not a valid URL. in /composer/vendor/crwlr/url/src/Url.php:771
Stack trace:
#0 /composer/vendor/crwlr/url/src/Url.php(80): Crwlr\Url\Url->validate()
#1 //composer/vendor/crwlr/url/src/Url.php(93): Crwlr\Url\Url->__construct()
#2 /composer/vendor/crwlr/url/src/Url.php(103): Crwlr\Url\Url::parse()
#3 /composer/vendor/crwlr/crawler/src/Steps/Step.php(180): Crwlr\Url\Url::parsePsr7()
#4 /composer/vendor/crwlr/crawler/src/Steps/Loading/Http.php(237): Crwlr\Crawler\Steps\Step->validateAndSanitizeToUriInterface()
#5 /composer/vendor/crwlr/crawler/src/Steps/Step.php(45): Crwlr\Crawler\Steps\Loading\Http->validateAndSanitizeInput()
#6 /composer/vendor/crwlr/crawler/src/Crawler.php(230): Crwlr\Crawler\Steps\Step->invokeStep()
#7 /composer/vendor/crwlr/crawler/src/Crawler.php(240): Crwlr\Crawler\Crawler->invokeStepsRecursive()
#8 /composer/vendor/crwlr/crawler/src/Crawler.php(240): Crwlr\Crawler\Crawler->invokeStepsRecursive()
#9 /composer/vendor/crwlr/crawler/src/Crawler.php(240): Crwlr\Crawler\Crawler->invokeStepsRecursive()
#10 /composer/vendor/crwlr/crawler/src/Crawler.php(240): Crwlr\Crawler\Crawler->invokeStepsRecursive()
#11 /composer/vendor/crwlr/crawler/src/Crawler.php(277): Crwlr\Crawler\Crawler->invokeStepsRecursive()
#12 /composer/vendor/crwlr/crawler/src/Crawler.php(263): Crwlr\Crawler\Crawler->storeAndReturnDefinedResults()
#13 /composer/vendor/crwlr/crawler/src/Crawler.php(187): Crwlr\Crawler\Crawler->storeAndReturnResults()
#14 /script/test/qdg.php(696): Crwlr\Crawler\Crawler->run()
#15 {main}
  thrown in /composer/vendor/crwlr/url/src/Url.php on line 771

I know, 2023-06-24T19:29:00+02:00 is not a valid URL but I do not know where i catch.

Any idea how to check if string is a valid URL before catch?

This is my code:

$crawler->input('https://mywebsite.com')->addStep(
        Http::get()->paginate('[class="pagination"] a', 50)
    )->addStep(
        Html::each('[class="thematic__row"] article header a')->extract([
            'url' => Dom::cssSelector('a')->attribute('href')
        ])
    )->addStep(
                Http::get()->useInputKeyAsUrl('url')
    )->addStep(

        Crawler::group()->addStep(

            Html::root()->extract([
                'title' => 'h1',
                'pubdate' => Dom::cssSelector('[pubdate="pubdate"]')->text(),
                'datetime' => Dom::cssSelector('[pubdate="pubdate"][itemprop="datePublished"]')->attribute('datetime'),
                'summary' => Dom::cssSelector('.summa')->text(),
                'content' => Dom::cssSelector('[class="the-article__content"] > div[class^="formatted-text"] > p')->text(),
                'people' => Dom::cssSelector('a[href^="/persone/"]')->text(),
            ])

        )->addStep(
            Html::metaData()->only(['og:url', 'og:image', 'article:section'])
        )->addToResult(['title', 'pubdate', 'datetime', 'summary', 'content', 'people', 'article:section', 'og:url', 'og:image'])

);
otsch commented 1 year ago

Hi again šŸ˜‰,

šŸ¤” so, one of the Http steps receives a date as an input. From the stack trace it looks like it must be the second one, because URLs found by the Paginator don't go that path. You can check all the outputs a certain step produces, using the output hook, like this:

$crawler
    ->input('https://mywebsite.com')
    ->addStep(...)    // stepIndex 0
    ->addStep(...)    // stepIndex 1
    ->addStep(...)    // stepIndex 2
    ->outputHook(function (Output $output, int $stepIndex, StepInterface $step) {
        if ($stepIndex === 1) {
            var_dump($output->get());
        }
    });

The hook is called with any output that step produces. I must admit I'm a little confused, because in your code you're selecting the href attribute of a Link with Dom::cssSelector('a')->attribute('href'). So this means there must be a link like this in the website's source: <a href="2023-06-24T19:29:00+02:00">...</a>. šŸ¤” Anyways...when you find the page where the selector gets that date, you can try to further refine the selector, so it will only get URLs.

Reading the code I saw a few things that you could probably improve:

otsch commented 1 year ago

And, maybe...if you're having a hard time finding the page where the selector get's that date as a URL, you could also use an output filter...something like:

$crawler
    // ...
    ->addStep(
        Html::getLinks('.thematic__row article header a')
            ->where(Filter::stringStartsWith('https://mywebsite.com/'))
    );

With that filter the step will just throw away any outputs not starting with https://mywebsite.com/.

flanderboy commented 1 year ago

and thank you again šŸ˜‰

otsch commented 1 year ago

And btw.: starring the repo, or even sponsoring me is highly appreciated! šŸ˜‰šŸ˜…

flanderboy commented 1 year ago

of course...