filiph / linkcheck

Fast link checker
https://pub.dartlang.org/packages/linkcheck
MIT License
403 stars 51 forks source link

support link whitelisting #2

Closed chalin closed 7 years ago

chalin commented 7 years ago

As an example of where this would be useful is when running the checker over https://webdev.dart-lang.org. We currently do not yet have an Angular guide for the Router, but we do have some Angular pages that already link into the (soon to be created) Router page. It would be great if we could whitelist links to the router page.

As an example the broken-link-checker has an excludeKeywords option. We use it like this under angular.io (note the value of the exclude array variable):

gulp.task('link-checker', () => {
  var method = 'get'; // the default 'head' fails for some sites
  var exclude = [
    // Dart API docs aren't working yet; ignore them
    '*/dart/latest/api/*',
    // Somehow the link checker sees ng1 {{...}} in the resource page; ignore it
    'resources/%7B%7Bresource.url%7D%7D',
    // API docs have links directly into GitHub repo sources; these can
    // quickly become invalid, so ignore them for now:
    '*/angular/tree/*',
    // harp.json "bios" for "Ryan Schmukler", URL isn't valid:
    'http://slingingcode.com'
  ];
  var blcOptions = { requestMethod: method, excludedKeywords: exclude};
  return linkChecker({ blcOptions: blcOptions });
});

cc @kwalrath @kevmoo

chalin commented 7 years ago

Another example:

External link http://caniuse.com/#feat=shadowdom failed: http://caniuse.com/#feat=shadowdom exists, but the hash 'feat=shadowdom' does not

The link is valid, but the checker cannot make sense of this particular use of an anchor/fragment, so it is likely a good candidate for whitelisting.

filiph commented 7 years ago

Will you be invoking linkcheck from the command line (like in a shell script)? In that case, how would you prefer to give the excluded regexps? As a separate text file?

linkcheck :4001 -x exclude.txt

Does that seem reasonable? The other option is to provide it in line, but that makes the invocation ugly and brittle.

If this configuration-by-file is okay with you, how would you prefer the exclude.txt file to look? Regexp per line, no comments? YAML? For example, have you ever wanted to have more structure in the exclude = [ ... ] option? Can you imagine needing something more than lines?

Also, does it need to be RegExp or should we use glob to make the writing of that file a bit easier?

Last but not least, should this feature be called whitelist or exclude or something else? Whitelist seems confusing to me, but so can exclude, I guess.

chalin commented 7 years ago

This is an example of linkcheck output that actually shows an error, despite the link being valid:

- http://localhost:4001/tools/dart2js
  *  External link https://developer.apple.com/library/safari/documentation/AppleApplications/Conceptual/Safari_Developer_Guide/Debugger/Debugger.html#//apple_ref/doc/uid/TP40007874-CH5-SW1 failed: response code 0 means something's wrong.
             It's possible libcurl couldn't connect to the server or perhaps the request timed out.
             Sometimes, making too many requests at once also breaks things.
             Either way, the return message (if any) from the server is: SSL connect error
filiph commented 7 years ago

I'm confused. Is this output from linkcheck? Or is it just an example of something you'd like to exclude?

chalin commented 7 years ago

This is output from linkcheck (I updated the comment to clarify that).

chalin commented 7 years ago

You ask valid questions. Here are some initial thoughts:

filiph commented 7 years ago

I will assume you want to (A) exclude the links as they are stated in href. The other approach (B) would be to exclude links by their final URL (after redirects). That would mean trying all links by default, just in case they end up being redirected to a non-skipped URL.

I'm implementing (A). Stop me if you'd prefer (B).

filiph commented 7 years ago

Ok done, please see this section of the readme. Let me know whether this works for you.

filiph commented 7 years ago

I should add: pub global activate linkcheck to get the newest version.

chalin commented 7 years ago

Very nice! It seems to be working like a charm!