No results after multiple runs

ErikSolveson commented 2 years ago

I have not been able to get any results back after running the checker on https://tim-blog-broken-link-checker.mystagingwebsite.com/

Attempt Details

I tried running for 2 hours, came back, wasn't sure so I refreshed the page. (Feb 11th)
I tried running for 4 hours, came back. (Feb 11th)
I tried running over the weekend, at some point, my login timed out. (Feb 12-13th)

I did notice a flash of what looked like an output, but instead of a number it read unknown and the status code was 301

Sorry, I couldn't take an actual screenshot of the code as it only appeared for a moment. https://tim-blog-broken-link-checker.mystagingwebsite.com/wp-admin/admin.php?page=team51-link-checker

CC @ecairol

ecairol commented 2 years ago

I suspect the process could be hanged in a loop:

Tims Blog homepage has pagination at the bottom of the page. I'm not 100% sure, but it could be the crawler is going from page 1 to page 2, and then (because Page 1 is also a link) it goes back to Page 1, and so on. If this is the case, we might need to modify the Crawler so it doesn't evaluate the same link twice (might even be a configuration parameter in Spatie/Crawler library). Not sure how flexible the library will be though.
On a similar note, the Topics sidebar on the homepage contains all categories with all posts, and when you click on a Topic you get again the same sidebar with all Topics

We could test this hypothesis by adding the parameter: setMaximumCrawlCount(100), adding a limit to the number of URLs that will be crawled.

By default, the crawler continues until it has crawled every page of the supplied URL. If you want to limit the amount of urls the crawler should crawl you can use the setMaximumCrawlCount method.

Crawler::create()
    ->setMaximumCrawlCount(5)

I can take a look at this later this week, but cc @nickpagz in case he has a chance to take a look sooner.

ErikSolveson commented 2 years ago

Hi @ecairol - Just wondering if you had anu update on setting the maximum pages for the crawler.

I'm not sure where to add the setMaximumCrawlCount() method, but I could take a swing if you point me in the right direction.

Thanks,

ecairol commented 2 years ago

@ErikSolveson sorry, this took longer than expected.

I've set the max crawl count to 3000 links. That means, the crawler won't keep running forever but as soon as it evaluated 3000 links, it will stop and render the report. This value is set on the code, and it's not a final solution, but it proves my initial guess of what was happening. The tool found 125 broken links.

While it's not a definite solution it makes the plugin functional for Tim blog, however, an actual fix needs to be developed:

The plugin should keep track of the evaluated URLs
If a URL has already been evaluated, skip it

What I suspect is happening is that the crawler hangs in a loop, probably in the pagination/categories links, jumping from one link to the previous one. A hard stop on 3000 links is a workaround but doesn't address the root problem.

Another option we could evaluate is to start the crawler from a sitemap index and don't go any level deep (this is related to https://github.com/a8cteam51/team51-link-checker/issues/2)

ErikSolveson commented 2 years ago

Hi @ecairol - Thanks for the update!

I see the link report that was generated and I downloaded the .csv if you want to run another test using this site.

https://docs.google.com/spreadsheets/d/1Htqh-3UXH_L7FZzwwPxGSvVbQ-nHS_gyyMKxvCbyS30/edit#gid=1939510776

However I'm seeing about 1050 broken links here ^

The previous link checker returned around 5000.

ErikSolveson commented 2 years ago

Hi @ecairol - I had some time available this afternoon (and very few functioning brain cells) so I went through and made up a report of all of the links with false positives from the most recent run on tim.blog.

This is just for your FYI!

https://docs.google.com/spreadsheets/d/1Y4PAN-Uq-bNH1gnGPAHaZucg4mdGXEaTLJjPYqT8UCg/edit#gid=0

HTTP Status Code: 301 (465 found) - Some broken 🤷 HTTP Status Code: 302 (447 found) - Some broken 🤷 HTTP Status Code: 303 (25 found) - 0/25 broken HTTP Status Code: 307 (3 found) - 0/3 broken HTTP Status Code: 308 (4 found) - 0/4 broken HTTP Status Code: 403 (12 found) - 2/12 broken HTTP Status Code: 404 (19 found) - 19/19 404ing 😄 HTTP Status Code: 406 (1 found) - 0/1 broken HTTP Status Code: 429 (51 found) - 0/51 broken (rate limiting for Instagram) HTTP Status Code: 503 (10 found) - 3/10 broken HTTP Status Code: N/A (19 found) - 10/19 broken

ErikSolveson commented 2 years ago

Hi Esteban,

I was just going through and noticed quite a few of the 301s and 302s were working just fine. We were wondering if it would be possible to follow the redirects to their destination and then not include them if the final code is 2XX.

CC @jonesch

jonesch commented 2 years ago

Filling in some more details here.

noticed quite a few of the 301s and 302s were working just fine.

We are seeing a number of Redirect chains that ultimately land on a 200. So in a case where we have a 301 in http://www.tim.blog:

 christopherjones@MacBook-Pro-2  ~/Sites  curl -IL http://www.tim.blog/
HTTP/1.1 301 Moved Permanently
Server: nginx
Date: Wed, 16 Mar 2022 15:18:31 GMT
Content-Type: text/html
Content-Length: 162
Connection: keep-alive
Location: https://www.tim.blog/
X-ac: 4.dca _atomic_dca

HTTP/2 301
server: nginx
date: Wed, 16 Mar 2022 15:18:31 GMT
content-type: text/html
content-length: 162
location: https://tim.blog/
x-ac: 3.dca _atomic_dca

HTTP/2 200
server: nginx
date: Wed, 16 Mar 2022 15:18:31 GMT
content-type: text/html; charset=UTF-8
vary: Accept-Encoding
host-header: Pressable
vary: Cookie
link: <https://tim.blog/wp-json/>; rel="https://api.w.org/"
x-ac: 3.dca _atomic_dca

@ecairol - is there any way that we can follow 301/302s to see where they ultimately land. Was it a 200, or 400, or anything other than a 3XX??

ecairol commented 2 years ago

According to this thread here, this is an option that can be configured on the crawler lib very easily.

Let me give it a try and re-run that on Tim Blog

ecairol commented 2 years ago

@jonesch @ErikSolveson I've updated the code so the crawler follows the redirects and now more than 900 records were removed from the report, meaning they were ending in a 200 response.

The number of reported of 404s went from 19 to 40+, while the rest of the categories remained almost the same

We do get one 301 but I suspect that's because it redirects too many times, and we have a limit to follow up to 5 redirects.

Please check it out and let me know your thoughts.

jonesch commented 2 years ago

Thank you, @ecairol!

@ErikSolveson - can you export a new report and compare that to what you had sent over to the partner?? And see if there is anything new we should make them aware of.

ErikSolveson commented 2 years ago

Hi @jonesch & @ecairol

I went through and compared the broken link reports. There were 22 new 404s, other than that no change.

However the 404s are all share=facebook or share=twitter links.

Spreadsheet for these 404s: https://docs.google.com/spreadsheets/d/1C2CRJHnos2wV9oLTsERiefgOr71rf5llWKqLfZtcWbQ/edit?usp=sharing

As these are all share links that work I haven't updated the partner yet.

ErikSolveson commented 2 years ago

Closing issue. I'll re-open if we face similar problems again.

Automattic / team51-link-checker

No results after multiple runs #9

Attempt Details