citation-style-language / styles

Official repository for Citation Style Language (CSL) citation styles.
https://citationstyles.org/
3.31k stars 3.77k forks source link

Check Documentation Links #2042

Open zuphilip opened 8 years ago

zuphilip commented 8 years ago

A discussion about how to check documentation links started on twitter: https://twitter.com/adam42smith/status/725749988702179329 CC @adam3smith @rmzelle @inukshuk

I am quite sure there are a lot of different ideas and solutions, but maybe we should first do some requirement engineering. What do we actually want?

adam3smith commented 8 years ago

When to check?

Which links?

Which header code?

What to do with fails?

rmzelle commented 8 years ago

We also don't want 404s to result in failing Travis builds, since such failures might be temporary (servers that are down) or affect styles other than the ones changed in PRs. Would easily end up being very confusing for contributors.

adam3smith commented 8 years ago

right -- in general we'd want CI errors and warnings to only apply to the current PR, which is why I was thinking script might be better.

fbennett commented 8 years ago

Another possibility would be to use perma.cc:

https://perma.cc/

adam3smith commented 8 years ago

since in this case changing links may mean changing styles, I don't think perma.cc is what we want (or am I misremembering what that does?)

rmzelle commented 8 years ago

Yeah, permalinks don't seem very useful. They hide the destination in the style, and we'd have linkrot of permalink targets instead of our own links.

zuphilip commented 8 years ago

If a documentation link fails then something else might also happen, e.g. they updated their style requirements. Thus, I guess we really want to capture these cases and then do something. What do we want to do then? If we check the documentation links of the whole repo, then we might end up with hundreds of failed links. Can we update them all manually? Just deleting them seems not helpful either...

adam3smith commented 8 years ago

Yes, I think we want a list of all failures and to go through them gradually, starting with 404s. Don't see an alternative. After the first pass, if we do this twice a year it should be pretty quick.

fbennett commented 8 years ago

A perma.cc page archives the page as written, a snapshot, and the original URL. It's an archival tool, so the saved page content doesn't reflect subsequent changes - but it does protect against complete loss of the style guide against which a style was prepared. It could be combined with a link-check script to detect gone-dead links. On Apr 29, 2016 06:26, "Sebastian Karcher" notifications@github.com wrote:

since in this case, changing links may mean changing styles, so I don't think perma.cc is what we want (or am I misremembering what that does?)

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/citation-style-language/styles/issues/2042#issuecomment-215567697

zuphilip commented 8 years ago

Here is some first hack for a script (actually a one-liner) which you can run on a bash:

$ grep -Poh '[^"]*(?=" rel="documentation")' *.csl | xargs curl -ILv

The output (only first 100 urls) is not yet pretty and you have to extract the information there somehow. You can search for example for "404 Not Found". Is this what we want?

fbennett commented 8 years ago

If the script were to extract links and set them as anchors in an (ephemeral) local index page, you could run LinkChecker over the page to get a nicely formatted report on redirects and bad links.

https://wummel.github.io/linkchecker/

fbennett commented 8 years ago

On second thought, linkchecker might not be at all good for this. I cobbled some code together and ran a full report against the independent styles. I'll attach the output in case it's of interest, but as you can see, the checker trips on lots of anomalies (bad certificates, mysterious server errors) that don't prevent a browser from accessing the page. linkcheck.zip

fbennett commented 8 years ago

The consistent errors from Wiley (500 Internal Server Error) are caused by their site configuration, which rejects HEAD requests. With curl and the -I option, you get the same result.

zuphilip commented 8 years ago

Here is an (incomplete) statistic from Frank`s result:

I guess that we could deal with some technical barriers by choosing another attempt. E.g. for Wiley instead of HEAD calls we could use full calls, e.g. curl -v "http://onlinelibrary.wiley.com/journal/10.1111/(ISSN)1529-8817/homepage/ForAuthors.html".

However, the question for me is, what can we do with the result?

Let me give you a specific example: In ambio.csl we have found a 404 documentation link http://www.springer.com/cda/content/document/cda_downloaddocument/Instructions_for_authors_AMBIO_2013.pdf . Would we then look for an updated documentation link at springer? Yes, there exists a newer version of the style requirement: http://www.springer.com/cda/content/document/cda_downloaddocument/Instructions_for_authors_AMBIO_2015.pdf?SGWID=0-0-45-960937-p173951212 . But what can we do then? We cannot simply replace the link, because the style requirements changed. However, I guess it is also impossible to check these documentation closely and update all the CSL styles. Or do I miss something here?

fbennett commented 8 years ago

A style could also change significantly without a change to its URL - or the URL could change but without any change to the style.

I'm not pushing legal tech solutions (really!), but these issues do bring to mind scotus-servo a little tool built by David Zvenyach before he joined 18F. It has a narrow purpose, works only on PDFs, and I'm not sure how it does text-diffing, but to detect some change to a Supreme Court opinion, it uses an MD5 checksum (or similar - it pushes the PDF into git, and then reads back its blob hash).

Not to harp on perma.cc (really!), but if they provided a flag showing whether the doc at the live link differs from the doc at the time of archiving, it would solve half of these issues. They just received a large grant to expand their service, and might be open to suggestions for added functionality. Alternatively, you could check for changes in a script well enough and with very little effort by saving a checksum.

(Granted that this doesn't address Philip's concern about how to react to style and URL changes, though.)

adam3smith commented 8 years ago

If this helps us to identify changing styles, I think that's a bonus feature, not a bug. So if Ambio has changed, what should happen is that we create an issue for that (and eventually work through it and then replace the link accordingly).

rmzelle commented 8 years ago

Regarding perma.cc, I'm not sure we could get an unlimited account. https://perma.cc/docs/faq#general says:

Anyone can sign up for a free Perma.cc account, which you can use to preserve up to 10 records per month. To preserve unlimited records, you have to be a member of an archiving organization sponsored by a registrar.

Anyway, I think that the primary function of the "documentation" URLs is to point to the relevant journal and/or style guide. If we can identify broken URLS and update them, we should, even if the style changed. It would still be an improvement.

zuphilip commented 8 years ago

Here are the 404 errors, if someone would like to start to work on them: https://gist.github.com/zuphilip/58a4d391fc71d2530151eea6c8117fec , but maybe it is easier to start with the 301 errors. IMO it can be really time consuming to go through some 404 cases...

For a perspective we should have think about a possibility to compare two versions of a style requirements together. I don't know much about perma.cc, so saving a snapshot is good, but changing urls is not what we are after. The way back machine offers another possibility to save a copy of any page (if crawlers are allowed) and you don't have to register or anything. How about a web hook after merging/pulling commits which calls for each documentation url this service, i.e.

http://web.archive.org/save/{url}

? Then we can be sure, that at any point later, we will still find the documentation it was built on at the way back machine.

stale[bot] commented 5 years ago

This issue hasn't seen any activity in the past 30 days. It will be automatically closed if no further activity occurs in the next two weeks.