Fatal error: is not a valid URL

flanderboy commented 1 year ago

Hello again, i'm try to get articles from a website but i receive this error:

PHP Fatal error:  Uncaught Crwlr\Url\Exceptions\InvalidUrlException: 2023-06-24T19:29:00+02:00 is not a valid URL. in /composer/vendor/crwlr/url/src/Url.php:771
Stack trace:
#0 /composer/vendor/crwlr/url/src/Url.php(80): Crwlr\Url\Url->validate()
#1 //composer/vendor/crwlr/url/src/Url.php(93): Crwlr\Url\Url->__construct()
#2 /composer/vendor/crwlr/url/src/Url.php(103): Crwlr\Url\Url::parse()
#3 /composer/vendor/crwlr/crawler/src/Steps/Step.php(180): Crwlr\Url\Url::parsePsr7()
#4 /composer/vendor/crwlr/crawler/src/Steps/Loading/Http.php(237): Crwlr\Crawler\Steps\Step->validateAndSanitizeToUriInterface()
#5 /composer/vendor/crwlr/crawler/src/Steps/Step.php(45): Crwlr\Crawler\Steps\Loading\Http->validateAndSanitizeInput()
#6 /composer/vendor/crwlr/crawler/src/Crawler.php(230): Crwlr\Crawler\Steps\Step->invokeStep()
#7 /composer/vendor/crwlr/crawler/src/Crawler.php(240): Crwlr\Crawler\Crawler->invokeStepsRecursive()
#8 /composer/vendor/crwlr/crawler/src/Crawler.php(240): Crwlr\Crawler\Crawler->invokeStepsRecursive()
#9 /composer/vendor/crwlr/crawler/src/Crawler.php(240): Crwlr\Crawler\Crawler->invokeStepsRecursive()
#10 /composer/vendor/crwlr/crawler/src/Crawler.php(240): Crwlr\Crawler\Crawler->invokeStepsRecursive()
#11 /composer/vendor/crwlr/crawler/src/Crawler.php(277): Crwlr\Crawler\Crawler->invokeStepsRecursive()
#12 /composer/vendor/crwlr/crawler/src/Crawler.php(263): Crwlr\Crawler\Crawler->storeAndReturnDefinedResults()
#13 /composer/vendor/crwlr/crawler/src/Crawler.php(187): Crwlr\Crawler\Crawler->storeAndReturnResults()
#14 /script/test/qdg.php(696): Crwlr\Crawler\Crawler->run()
#15 {main}
  thrown in /composer/vendor/crwlr/url/src/Url.php on line 771

I know, 2023-06-24T19:29:00+02:00 is not a valid URL but I do not know where i catch.

Any idea how to check if string is a valid URL before catch?

This is my code:

$crawler->input('https://mywebsite.com')->addStep(
        Http::get()->paginate('[class="pagination"] a', 50)
    )->addStep(
        Html::each('[class="thematic__row"] article header a')->extract([
            'url' => Dom::cssSelector('a')->attribute('href')
        ])
    )->addStep(
                Http::get()->useInputKeyAsUrl('url')
    )->addStep(

        Crawler::group()->addStep(

            Html::root()->extract([
                'title' => 'h1',
                'pubdate' => Dom::cssSelector('[pubdate="pubdate"]')->text(),
                'datetime' => Dom::cssSelector('[pubdate="pubdate"][itemprop="datePublished"]')->attribute('datetime'),
                'summary' => Dom::cssSelector('.summa')->text(),
                'content' => Dom::cssSelector('[class="the-article__content"] > div[class^="formatted-text"] > p')->text(),
                'people' => Dom::cssSelector('a[href^="/persone/"]')->text(),
            ])

        )->addStep(
            Html::metaData()->only(['og:url', 'og:image', 'article:section'])
        )->addToResult(['title', 'pubdate', 'datetime', 'summary', 'content', 'people', 'article:section', 'og:url', 'og:image'])

);

otsch commented 1 year ago

Hi again 😉,

🤔 so, one of the Http steps receives a date as an input. From the stack trace it looks like it must be the second one, because URLs found by the Paginator don't go that path. You can check all the outputs a certain step produces, using the output hook, like this:

$crawler
    ->input('https://mywebsite.com')
    ->addStep(...)    // stepIndex 0
    ->addStep(...)    // stepIndex 1
    ->addStep(...)    // stepIndex 2
    ->outputHook(function (Output $output, int $stepIndex, StepInterface $step) {
        if ($stepIndex === 1) {
            var_dump($output->get());
        }
    });

The hook is called with any output that step produces. I must admit I'm a little confused, because in your code you're selecting the href attribute of a Link with Dom::cssSelector('a')->attribute('href'). So this means there must be a link like this in the website's source: <a href="2023-06-24T19:29:00+02:00">...</a>. 🤔 Anyways...when you find the page where the selector gets that date, you can try to further refine the selector, so it will only get URLs.

Reading the code I saw a few things that you could probably improve:

CSS selectors like [class="pagination"] a can be changed to: .pagination a.
In the second step you're selecting [class="thematic__row"] article header a and then, inside all that a tags matching that selector, you're looking for a. So, another link, inside that link. If that's not what you wanted to do, and the links inside the .thematic__row article header is what you want to get, you can do: Html::getLinks('.thematic__row article header a'). Or if you really want to get the link inside a link, it's still better to use that step: Html::getLinks('.thematic__row article header a a'). You'd only use the HTML extract step, if you need to get more data from that page, not only the URLs for the next step. Using the Html::getLinks() step, you then also don't need the call to useInputKeyAsUrl() on the following step. And another tip if you need to get more data besides the URLs: Dom::cssSelector('a')->attribute('href') gets you the exact content of the href attribute of the link element. That's fine if the page uses proper absolute links in there. But it's safer if you just do: Dom::cssSelector('a')->link(). This will in any case give you proper absolute URLs.
As far as I see, in the last step, you're calling addToResult([...]) with a list of all properties extracted in the grouped steps. When you want to add the whole output of a step to the final crawling result, you don't need to list them. Just call addToResult() without a parameter. As the last step is the only one, where you call the addToResult() method, you can also just omit this at all, because if you don't compose the crawling results explicitly, the crawler just returns the outputs of the last step as the results.

otsch commented 1 year ago

And, maybe...if you're having a hard time finding the page where the selector get's that date as a URL, you could also use an output filter...something like:

$crawler
    // ...
    ->addStep(
        Html::getLinks('.thematic__row article header a')
            ->where(Filter::stringStartsWith('https://mywebsite.com/'))
    );

With that filter the step will just throw away any outputs not starting with https://mywebsite.com/.

flanderboy commented 1 year ago

and thank you again 😉

otsch commented 1 year ago

And btw.: starring the repo, or even sponsoring me is highly appreciated! 😉😅

flanderboy commented 1 year ago

of course...

crwlrsoft / crawler

Fatal error: is not a valid URL #109