entry URLs will change if the entry changes or if the code for generating the hash is non-reproducible

dkg commented 3 years ago

As noted over in #17, if json.dumps ends up dynamically re-ordering the keys of the hash, or change the default for sort_keys to true, then the URLs for each entry will change.

Likewise, if any entry changes or is updated at all, its URL will also change.

It seems simplest to have an identifier associated with each entry, and a consistency check to ensure that this identifier is in fact unique. Then the URL for each entry will be effectively static, even if the entry is updated, or if json.dumps ends up producing different output.

There would be a one-time cost of producing the identifier for each entry, but that can be done with a one-time pass over the input database. If i understood better how the database is produced (i don't use zotero) i'd offer you code that does that one-time pass.

kernelmethod commented 3 years ago

I assume (naively hope?) that the behavior and API of json.dumps will stay the same between Python versions in the short term, but I concur that this is a problem, especially since the URL changes whenever the entry is changed at all.

Having a UID of some sort for each entry in the compendium would definitely be ideal. As a practical matter it's a bit difficult since it'd mean enforcing guidelines about what should be in the content of a compendium entry, which from past experience has been difficult to achieve since the entry curation team has rotated a few times, and Zotero doesn't provide any easy ways to enforce those guidelines. We've been getting much closer to implementing guidelines and ensuring that everyone abides by them, but it's still a WIP. :slightly_smiling_face:

I'll need to mull over this a bit more. I think if we can at least guarantee that the URL doesn't change that often, perhaps by limiting the hash to just fields like the title, author, and publication date that shouldn't change unless an error is made when adding the entry to the compendium, that'll probably be sufficient.

dkg commented 3 years ago

If you want the outut of json.dumps to stay the same, the slug generation should at least be supplying sort_keys=true before serializing the dict into a digest: dict serialization is the most likely place where trivial non-reproducibility could slip in (see reproducible-builds docs on stable outputs and stable inputs for more thinking on this kind of problem).

The situation where you want to fix a bug in the title, author, or publication date is maybe the most important time when you want a URL that doesn't change. If someone has referenced a given article through its link in the encryption compendium, you want that link to still work after the correction, otherwise the article (and the correction!) is lost.

You could try to solve this problem by adding aliases to the database, so that when you update the relevant text of an entry, you just store an alias of what its slug+digest used to be, and produce symlinks (or HTTP redirects) for each alias. But the problem there ends up looking pretty similar to the approach of storing a shortname for each article -- maybe worse! in particular, you still probably want to do a consistency/uniqueness check on the stored set of aliases + generated slug+digests.

I hear you about being frustrated by the enforcement capabilities that Zotero doesn't offer! I don't really understand the project policy (or mechanism) about how curation responsibilities are shared or handed off, and i don't know how frequently you imagine updating the site with a new entry once the infrastructure is in place and the software has settled down. But it looks like you'll need a consistency check before publication anyway -- at least to ensure that the generated slug+digest entries themselves don't accidentally collide. Why not implement your own consistency check during the build which identifies which entries don't have the associated shortname field: such a check could report all the missing shortname fields as errors in a way that the release manager can easily fix and re-run (e.g. maybe it could produce some sort of diff that the release manager could merge and then re-upload to zotero).

All of this would be more straightforward to do if data.bib were canonically in the same git repository though, i think. Is there a reason to avoid including data.bib here?

dkg commented 3 years ago

It occurs to me that the way that the current python code is using bibtexparser, it basically assumes that each bibtex entry's id is unique (the code accesses the dict object's values, which means that objects with identical IDs will collide).

Given this fact, we could just use the raw bibtex ID (it already guarantees uniqueness!) instead of involving slugify or hashing. This is available in the bibtex entry as ['ID'], i think, in addition to being the key of the dict.

You'd probably want to start with a cleanup pass through the existing data.bib to assign more salient IDs, though, and i don't know how those IDs are exposed in zotero. I've opened #54 to at least check for duplicates.

The-Encryption-Compendium / TECv2

entry URLs will change if the entry changes or if the code for generating the hash is non-reproducible #34