citation-style-language / schema

Citation Style Language schema
https://citationstyles.org/
MIT License
185 stars 60 forks source link

citations: support for archival copies of online content (concrete: web.archive.org support) #387

Open slippycheeze opened 3 years ago

slippycheeze commented 3 years ago

Is your feature request related to a problem? Please describe. My writing is mostly in a non-academic context, so many of my citations are "regular" web pages, not formal publications. Y'all are probably familiar, but these pages have a relatively short half-life before they succumb to "link-rot", or otherwise fall off the internet.

The rate of change of page content is also higher than published works, which means that a citation on any date may be impossible to locate afterwards – especially as most sites do not publish a change history, or even a note that they have revised a document.

https://web.archive.org/ was created to help address this problem. It captures and archives web content in a stable way, making it possible to retrieve the content on the date it was stored by the archive for approximately forever. This makes it possible to have a stable citation, with the content on the specific date, as well as the live citation.

I'd like to have content I cite added to the archive (not your problem), and then track both the original and the archival links. If I cite the content again in a later document I'd generally like to reuse the same record, but add a new snapshot to match the "accessed on" date for the new reading.

Describe the solution you'd like I'd like to be able to record multiple archived URIs for a citation, in addition to the primary URL. The structure I imagine is roughly as follows, though I don't know the CSL-JSON input schema well, so the field names and date formats are almost certainly very wrong.

I hope it is illustrative of what I'm thinking, but in general anything that let me attach archive snapshot URLs in a standard way would be great.

{
  "archive-urls": [
    {
      "url": "https://web.archive.org/20201101/https://example.com/",
      "access-date": "2020-11-01",
      "capture-date": "2020-11-12 13:54",
      "archive-name": "web.archive.org",
      "....": "whatever other metadata per connection"
    }
  ]
}

that is: a list of archive URLs, each record having the URL to retrieve the content, and the date it was captured. additional metadata like the various "archive" fields might make sense here, too, since web.archive.org is not the only possible way to capture the content.

Describe alternatives you've considered The existing archive field, and associated location, could hold a single archive record. That feels ... possible, but ugly. It also means that I'd need a separate citation record for each capture, which means having to deal with denormalised data if, eg, the author name or publication details need to be updated at some point.

I could simply cite using the web.archive.org link all the time, which would work, but would mean sending people to something other than the standard source. While that is possible, it isn't the experience I'd prefer. (I'd rather add a separate link for the appropriate web.archive.org version to the citation along with the original URL, while the original still worked.)

Custom fields work fine for this, of course. This request is about getting the concept standardised so that I'm not the only person in the universe who has tools that touch this stuff. Ideally it'll even end up in some of the standard CSL citation rules, so that stable archival versions of these citations become more common. (even for more formally published papers, or their freely available versions on author websites, which have a much higher tendency to decay than places like arxiv.)

Additional context I'm not expecting CSL to deal with any part of the archiving process, just to record data about it.

That said, a workflow that takes CSL-JSON as input, created the web.archive.org snapshot, and emitted CSL-JSON with the archive URL added is a very reasonable part of a pandoc-based publication pipeline. In my case that is markdown => pandoc => manubot pandoc filter => web.archive.org snapshot filter => pandoc output. both filters use look-aside caching in CSL-JSON files to maintain their records.

bwiernik commented 3 years ago

It's planned to have a new identifiers feature in CSL 1.1 or 1.2. I think archive_url could be one of those.