Open zuphilip opened 8 years ago
When to check?
Which links?
Which header code?
What to do with fails?
We also don't want 404s to result in failing Travis builds, since such failures might be temporary (servers that are down) or affect styles other than the ones changed in PRs. Would easily end up being very confusing for contributors.
right -- in general we'd want CI errors and warnings to only apply to the current PR, which is why I was thinking script might be better.
Another possibility would be to use perma.cc:
since in this case changing links may mean changing styles, I don't think perma.cc is what we want (or am I misremembering what that does?)
Yeah, permalinks don't seem very useful. They hide the destination in the style, and we'd have linkrot of permalink targets instead of our own links.
If a documentation link fails then something else might also happen, e.g. they updated their style requirements. Thus, I guess we really want to capture these cases and then do something. What do we want to do then? If we check the documentation links of the whole repo, then we might end up with hundreds of failed links. Can we update them all manually? Just deleting them seems not helpful either...
Yes, I think we want a list of all failures and to go through them gradually, starting with 404s. Don't see an alternative. After the first pass, if we do this twice a year it should be pretty quick.
A perma.cc page archives the page as written, a snapshot, and the original URL. It's an archival tool, so the saved page content doesn't reflect subsequent changes - but it does protect against complete loss of the style guide against which a style was prepared. It could be combined with a link-check script to detect gone-dead links. On Apr 29, 2016 06:26, "Sebastian Karcher" notifications@github.com wrote:
since in this case, changing links may mean changing styles, so I don't think perma.cc is what we want (or am I misremembering what that does?)
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/citation-style-language/styles/issues/2042#issuecomment-215567697
Here is some first hack for a script (actually a one-liner) which you can run on a bash:
$ grep -Poh '[^"]*(?=" rel="documentation")' *.csl | xargs curl -ILv
The output (only first 100 urls) is not yet pretty and you have to extract the information there somehow. You can search for example for "404 Not Found". Is this what we want?
If the script were to extract links and set them as anchors in an (ephemeral) local index page, you could run LinkChecker over the page to get a nicely formatted report on redirects and bad links.
On second thought, linkchecker
might not be at all good for this. I cobbled some code together and ran a full report against the independent styles. I'll attach the output in case it's of interest, but as you can see, the checker trips on lots of anomalies (bad certificates, mysterious server errors) that don't prevent a browser from accessing the page.
linkcheck.zip
The consistent errors from Wiley (500 Internal Server Error) are caused by their site configuration, which rejects HEAD requests. With curl and the -I option, you get the same result.
Here is an (incomplete) statistic from Frank`s result:
I guess that we could deal with some technical barriers by choosing another attempt. E.g. for Wiley instead of HEAD calls we could use full calls, e.g. curl -v "http://onlinelibrary.wiley.com/journal/10.1111/(ISSN)1529-8817/homepage/ForAuthors.html"
.
However, the question for me is, what can we do with the result?
Let me give you a specific example: In ambio.csl
we have found a 404 documentation link http://www.springer.com/cda/content/document/cda_downloaddocument/Instructions_for_authors_AMBIO_2013.pdf . Would we then look for an updated documentation link at springer? Yes, there exists a newer version of the style requirement: http://www.springer.com/cda/content/document/cda_downloaddocument/Instructions_for_authors_AMBIO_2015.pdf?SGWID=0-0-45-960937-p173951212 . But what can we do then? We cannot simply replace the link, because the style requirements changed. However, I guess it is also impossible to check these documentation closely and update all the CSL styles. Or do I miss something here?
A style could also change significantly without a change to its URL - or the URL could change but without any change to the style.
I'm not pushing legal tech solutions (really!), but these issues do bring to mind scotus-servo a little tool built by David Zvenyach before he joined 18F. It has a narrow purpose, works only on PDFs, and I'm not sure how it does text-diffing, but to detect some change to a Supreme Court opinion, it uses an MD5 checksum (or similar - it pushes the PDF into git, and then reads back its blob hash).
Not to harp on perma.cc (really!), but if they provided a flag showing whether the doc at the live link differs from the doc at the time of archiving, it would solve half of these issues. They just received a large grant to expand their service, and might be open to suggestions for added functionality. Alternatively, you could check for changes in a script well enough and with very little effort by saving a checksum.
(Granted that this doesn't address Philip's concern about how to react to style and URL changes, though.)
If this helps us to identify changing styles, I think that's a bonus feature, not a bug. So if Ambio has changed, what should happen is that we create an issue for that (and eventually work through it and then replace the link accordingly).
Regarding perma.cc, I'm not sure we could get an unlimited account. https://perma.cc/docs/faq#general says:
Anyone can sign up for a free Perma.cc account, which you can use to preserve up to 10 records per month. To preserve unlimited records, you have to be a member of an archiving organization sponsored by a registrar.
Anyway, I think that the primary function of the "documentation" URLs is to point to the relevant journal and/or style guide. If we can identify broken URLS and update them, we should, even if the style changed. It would still be an improvement.
Here are the 404 errors, if someone would like to start to work on them: https://gist.github.com/zuphilip/58a4d391fc71d2530151eea6c8117fec , but maybe it is easier to start with the 301 errors. IMO it can be really time consuming to go through some 404 cases...
For a perspective we should have think about a possibility to compare two versions of a style requirements together. I don't know much about perma.cc, so saving a snapshot is good, but changing urls is not what we are after. The way back machine offers another possibility to save a copy of any page (if crawlers are allowed) and you don't have to register or anything. How about a web hook after merging/pulling commits which calls for each documentation url
this service, i.e.
http://web.archive.org/save/{url}
? Then we can be sure, that at any point later, we will still find the documentation it was built on at the way back machine.
This issue hasn't seen any activity in the past 30 days. It will be automatically closed if no further activity occurs in the next two weeks.
A discussion about how to check documentation links started on twitter: https://twitter.com/adam42smith/status/725749988702179329 CC @adam3smith @rmzelle @inukshuk
I am quite sure there are a lot of different ideas and solutions, but maybe we should first do some requirement engineering. What do we actually want?