Closed flanderboy closed 1 year ago
Hi again š,
š¤ so, one of the Http
steps receives a date as an input. From the stack trace it looks like it must be the second one, because URLs found by the Paginator don't go that path. You can check all the outputs a certain step produces, using the output hook, like this:
$crawler
->input('https://mywebsite.com')
->addStep(...) // stepIndex 0
->addStep(...) // stepIndex 1
->addStep(...) // stepIndex 2
->outputHook(function (Output $output, int $stepIndex, StepInterface $step) {
if ($stepIndex === 1) {
var_dump($output->get());
}
});
The hook is called with any output that step produces. I must admit I'm a little confused, because in your code you're selecting the href
attribute of a Link with Dom::cssSelector('a')->attribute('href')
. So this means there must be a link like this in the website's source: <a href="2023-06-24T19:29:00+02:00">...</a>
. š¤ Anyways...when you find the page where the selector gets that date, you can try to further refine the selector, so it will only get URLs.
Reading the code I saw a few things that you could probably improve:
[class="pagination"] a
can be changed to: .pagination a
.[class="thematic__row"] article header a
and then, inside all that a
tags matching that selector, you're looking for a
. So, another link, inside that link. If that's not what you wanted to do, and the links inside the .thematic__row article header
is what you want to get, you can do: Html::getLinks('.thematic__row article header a')
. Or if you really want to get the link inside a link, it's still better to use that step: Html::getLinks('.thematic__row article header a a')
. You'd only use the HTML extract step, if you need to get more data from that page, not only the URLs for the next step. Using the Html::getLinks()
step, you then also don't need the call to useInputKeyAsUrl()
on the following step. And another tip if you need to get more data besides the URLs: Dom::cssSelector('a')->attribute('href')
gets you the exact content of the href
attribute of the link element. That's fine if the page uses proper absolute links in there. But it's safer if you just do: Dom::cssSelector('a')->link()
. This will in any case give you proper absolute URLs.addToResult([...])
with a list of all properties extracted in the grouped steps. When you want to add the whole output of a step to the final crawling result, you don't need to list them. Just call addToResult()
without a parameter. As the last step is the only one, where you call the addToResult()
method, you can also just omit this at all, because if you don't compose the crawling results explicitly, the crawler just returns the outputs of the last step as the results.And, maybe...if you're having a hard time finding the page where the selector get's that date as a URL, you could also use an output filter...something like:
$crawler
// ...
->addStep(
Html::getLinks('.thematic__row article header a')
->where(Filter::stringStartsWith('https://mywebsite.com/'))
);
With that filter the step will just throw away any outputs not starting with https://mywebsite.com/
.
and thank you again š
And btw.: starring the repo, or even sponsoring me is highly appreciated! šš
of course...
Hello again, i'm try to get articles from a website but i receive this error:
I know, 2023-06-24T19:29:00+02:00 is not a valid URL but I do not know where i catch.
Any idea how to check if string is a valid URL before catch?
This is my code: