Mysterious 404 on a few sites

JoshOrndorff commented 4 years ago

I use linkcheck for the Substrate Recipes. Thank you for the excellent backend.

So far I've encountered two links that regularly cause the link checker to fail, despite loading fine in a normal web browser. You can see the more recent occurence in this PR https://github.com/substrate-developer-hub/recipes/pull/180 And you can see that I've worked around the issue by adding the url to my exclude list.

Ultimately I'd prefer to properly diagnose the failure rather than excluding them.

Michael-F-Bryan commented 4 years ago

I don't think this is specific to the linkchecker. Running curl against the num-traits crate returns a 404 for the same URL.

$ curl -I https://crates.io/crates/num-traits
HTTP/2 404
content-type: application/json; charset=utf-8
content-length: 35
server: nginx
date: Tue, 24 Mar 2020 02:45:06 GMT
set-cookie: cargo_session=sJIiNcfM9yvCHoGNENQaO8JrPoTF1c7xuZ6xe/LTieY=; HttpOnly; Secure; Path=/
strict-transport-security: max-age=31536000
via: 1.1 vegur, 1.1 6e19875b14d906dfd0ef8e65e8726f1d.cloudfront.net (CloudFront)
x-cache: Error from cloudfront
x-amz-cf-pop: PER50-C1
x-amz-cf-id: yBCN032584y1tHHrOzh9Er41QMS01bZ4OZ1IeCBJHpjwwlyH7Y2n9A==
age: 63

I have a feeling this is because crates.io is built using a JavaScript framework like ember or react. When you open it in your browser it'll fall back to / and then the JS router will change the URL to /crates/num-traits. The linkchecker essentially calls reqwest::get(), so we don't run any JS.

This is probably related rust-lang/crates.io#788 (see https://github.com/rust-lang/rustc-dev-guide/pull/184#issuecomment-421537610).

mark-i-m commented 4 years ago

Yes, this is true for any crates.io URL. We have explicitly blacklisted URLs to crates.io in the rustc-dev-guide.

JoshOrndorff commented 4 years ago

Okay, guess not much to do here then. Thanks for the explanation.

dogweather commented 10 months ago

I found a workaround for link-checking to crates.io. Check docs.rs instead:

Instead of

curl --head https://crates.io/crates/num-complex

Do:

curl --head https://docs.rs/num-complex/latest/num_complex/

dlaehnemann commented 6 months ago

I had the same problem with a couple of domains / websites and found a different GitHub Action that works for me for link checking: linkspector

It seems to do the checks with mocking up some kind of credible browser session, and then all the websites I currently have in there, give a proper response. Also, it checks internal MarkDown links correctly, and also offers to check links in other formats (like RestructuredText).

For the maintainer, maybe there are good ideas in there? Or this also solves your needs in a more general way? In any case, many thanks for your efforts on this linkchecker, it was very useful!

Michael-F-Bryan / mdbook-linkcheck

Mysterious 404 on a few sites #30