Closed ErikSolveson closed 2 years ago
I suspect the process could be hanged in a loop:
We could test this hypothesis by adding the parameter: setMaximumCrawlCount(100)
, adding a limit to the number of URLs that will be crawled.
By default, the crawler continues until it has crawled every page of the supplied URL. If you want to limit the amount of urls the crawler should crawl you can use the setMaximumCrawlCount method.
Crawler::create()
->setMaximumCrawlCount(5)
I can take a look at this later this week, but cc @nickpagz in case he has a chance to take a look sooner.
Hi @ecairol - Just wondering if you had anu update on setting the maximum pages for the crawler.
I'm not sure where to add the setMaximumCrawlCount()
method, but I could take a swing if you point me in the right direction.
Thanks,
@ErikSolveson sorry, this took longer than expected.
I've set the max crawl count to 3000 links. That means, the crawler won't keep running forever but as soon as it evaluated 3000 links, it will stop and render the report. This value is set on the code, and it's not a final solution, but it proves my initial guess of what was happening. The tool found 125 broken links.
While it's not a definite solution it makes the plugin functional for Tim blog, however, an actual fix needs to be developed:
What I suspect is happening is that the crawler hangs in a loop, probably in the pagination/categories links, jumping from one link to the previous one. A hard stop on 3000 links is a workaround but doesn't address the root problem.
Another option we could evaluate is to start the crawler from a sitemap index and don't go any level deep (this is related to https://github.com/a8cteam51/team51-link-checker/issues/2)
Hi @ecairol - Thanks for the update!
I see the link report that was generated and I downloaded the .csv if you want to run another test using this site.
However I'm seeing about 1050 broken links here ^
The previous link checker returned around 5000.
Hi @ecairol - I had some time available this afternoon (and very few functioning brain cells) so I went through and made up a report of all of the links with false positives from the most recent run on tim.blog.
This is just for your FYI!
https://docs.google.com/spreadsheets/d/1Y4PAN-Uq-bNH1gnGPAHaZucg4mdGXEaTLJjPYqT8UCg/edit#gid=0
HTTP Status Code: 301 (465 found) - Some broken 🤷 HTTP Status Code: 302 (447 found) - Some broken 🤷 HTTP Status Code: 303 (25 found) - 0/25 broken HTTP Status Code: 307 (3 found) - 0/3 broken HTTP Status Code: 308 (4 found) - 0/4 broken HTTP Status Code: 403 (12 found) - 2/12 broken HTTP Status Code: 404 (19 found) - 19/19 404ing 😄 HTTP Status Code: 406 (1 found) - 0/1 broken HTTP Status Code: 429 (51 found) - 0/51 broken (rate limiting for Instagram) HTTP Status Code: 503 (10 found) - 3/10 broken HTTP Status Code: N/A (19 found) - 10/19 broken
Hi Esteban,
I was just going through and noticed quite a few of the 301s and 302s were working just fine. We were wondering if it would be possible to follow the redirects to their destination and then not include them if the final code is 2XX.
CC @jonesch
Filling in some more details here.
noticed quite a few of the 301s and 302s were working just fine.
We are seeing a number of Redirect chains that ultimately land on a 200. So in a case where we have a 301 in http://www.tim.blog:
christopherjones@MacBook-Pro-2 î‚° ~/Sites î‚° curl -IL http://www.tim.blog/
HTTP/1.1 301 Moved Permanently
Server: nginx
Date: Wed, 16 Mar 2022 15:18:31 GMT
Content-Type: text/html
Content-Length: 162
Connection: keep-alive
Location: https://www.tim.blog/
X-ac: 4.dca _atomic_dca
HTTP/2 301
server: nginx
date: Wed, 16 Mar 2022 15:18:31 GMT
content-type: text/html
content-length: 162
location: https://tim.blog/
x-ac: 3.dca _atomic_dca
HTTP/2 200
server: nginx
date: Wed, 16 Mar 2022 15:18:31 GMT
content-type: text/html; charset=UTF-8
vary: Accept-Encoding
host-header: Pressable
vary: Cookie
link: <https://tim.blog/wp-json/>; rel="https://api.w.org/"
x-ac: 3.dca _atomic_dca
@ecairol - is there any way that we can follow 301/302s to see where they ultimately land. Was it a 200, or 400, or anything other than a 3XX??
According to this thread here, this is an option that can be configured on the crawler lib very easily.
Let me give it a try and re-run that on Tim Blog
@jonesch @ErikSolveson I've updated the code so the crawler follows the redirects and now more than 900 records were removed from the report, meaning they were ending in a 200 response.
The number of reported of 404s went from 19 to 40+, while the rest of the categories remained almost the same
We do get one 301 but I suspect that's because it redirects too many times, and we have a limit to follow up to 5 redirects.
Please check it out and let me know your thoughts.
Thank you, @ecairol!
@ErikSolveson - can you export a new report and compare that to what you had sent over to the partner?? And see if there is anything new we should make them aware of.
Hi @jonesch & @ecairol
I went through and compared the broken link reports. There were 22 new 404s, other than that no change.
However the 404s are all share=facebook
or share=twitter
links.
Spreadsheet for these 404s: https://docs.google.com/spreadsheets/d/1C2CRJHnos2wV9oLTsERiefgOr71rf5llWKqLfZtcWbQ/edit?usp=sharing
As these are all share links that work I haven't updated the partner yet.
Closing issue. I'll re-open if we face similar problems again.
I have not been able to get any results back after running the checker on https://tim-blog-broken-link-checker.mystagingwebsite.com/
Attempt Details
I did notice a flash of what looked like an output, but instead of a number it read
unknown
and the status code was301
Sorry, I couldn't take an actual screenshot of the code as it only appeared for a moment. https://tim-blog-broken-link-checker.mystagingwebsite.com/wp-admin/admin.php?page=team51-link-checker
CC @ecairol