filiph / linkcheck

Fast link checker
https://pub.dartlang.org/packages/linkcheck
MIT License
403 stars 51 forks source link

Servers often behave very differently than filesystems -- which and how? #58

Closed untitaker closed 3 years ago

untitaker commented 3 years ago

In your README you state:

Servers often behave very differently than file systems, so validating links on the file system often leads to both false positives and false negatives.

Surely linkchecker could make the same assumptions that static site generators already need to make to produce correct output? If that's not true, or if it would need to make more assumptions than that, could you elaborate on which changes in behavior you saw that make this impossible?

filiph commented 3 years ago

Hi, good question. I should write this up in the readme.

The short answer is that servers have configuration that is invisible to source-code / file-system validators. I've had way too many false positive and false negatives with file-system based navigators.

For example, a link that, in theory, works just fine, will break on the server because there's some rewriter or redirector in effect, or because the server is configured to require trailing slashes or something. Of course, in theory, you can try reading that configuration and recreating all the rewrites, redirects and special cases in your link validator. But then you're basically implementing a non-trivial subset of a server. And also, if there's more than one server you want to support (github pages, ngx, firebase hosting, aws), you have to implement multiple servers.

For that reason, linkcheck just goes with the "run your localhost server and check that" approach. It's fast despite this. (Of course, a file system checker could be much, much faster than that. But again, it's not safe, at least for my use cases.)

Anyway, hope this helps. In short — it's not impossible, and in fact, some checkers use this approach. But it does have downsides.

untitaker commented 3 years ago

And also, if there's more than one server you want to support (github pages, ngx, firebase hosting, aws), you have to implement multiple servers.

The problem I see is that if there's some special behavior to your static hoster, you'd need to replicate that for local development anyway. So for example, if GitHub pages decided to deviate from the de-facto standard that is URL-to-FS-mapping by e.g. enforcing trailing slashes (without offering redirects), it would mean two things:

This seems like a significant burden to adoption of GH pages. Therefore it must be treated as a bug in GH pages, no? Like, even if one argues that requiring trailing slashes for directories is more correct behavior, fixing all other tools to work that way just can't happen.

Not trying to argue that your experience wasn't that way, just trying to figure out what exact incompatibilities you found, and particularly whether they were caused by the enduser installing redirects or rather by upstream incompatibilities between static site generators and static site hosters. I'm rather interested in this because I'm implementing my own link checker and I want to know what's coming :) I know it's a question that takes a lot of time and archive digging to answer so feel free to not respond.

Also asked @raviqqe the same question in raviqqe/liche#50.