EticaAI / HXL-Data-Science-file-formats

Common file formats used for Data Science and language localization exported from (and to) HXL (The Humanitarian Exchange Language)
https://hdp.etica.ai/
The Unlicense
3 stars 1 forks source link

[meta] HDP files strategies of integrity and authenticity (hash, digital signatures, ...) #17

Open fititnt opened 3 years ago

fititnt commented 3 years ago

Related:


First things first: one primary goal of HDP files themselves is both to allow exchange of how to reference datasets and how data is allowed to be manipulated and, as consequence, this means auditability. Also, HDP files (at least the ones used for end users) are meant to be usable if printed on paper (think a judge attaching HDP instructions that on worst case someone would have to digit again). HDP files should be human readable

Note that the data themselves can (and by default is!) considered sensitive. But the ideal (and, this means what is being optimized) is that even if people exchange HDP files could do it without fear if the files leak or need to be audited. This means that even if we could make it easier to embed passwords or direct access to private resources on file we're likely to make it intentionally hard, so the average user is likely to simply don't know how to use it.

File Based Encryption of (typically) HDP file is not an goal, but integrity (and some cases, authenticity) is required

1. So what's the point of integrity checks?

One core feature of HDP is having in common vocabulary to allow translation of the HDP files between different human natural languages and do in such a way that whatever was the original natural language written, the file could ideally still keep like the original way.

In other words: if the HDP file is being translated on-the-fly if an user does not understand Modern Standard Arab, we could have multiple teams exchange (maybe even working with the same filesystem!) even if most people don't speak same language

But then one point of improvement happen:

  1. how to check if an on-the-fly translation was not changed?
  2. What if tools make some easy to catch mistakes and now the original file is not reversible on-the-fly?
  3. What if tools that make the hash received upgrade, new hashing do not match? (Note that for this case, since HDP have much, MUCH more moving parts than an static file, users could upgrade old files or at least use external tool, like file based, to test integrity)

Note: the HDP files themselves (as soon as eventually not just Latin Language being the reference, but all other core languages being equally valid) may intentionally need changes. So some way to check can help humans to avoid out-of-sync states

1.1 Some non-cryptographic hashing

Actually to make it feasible to translate from and to other languages we need some integrity check. This is why we need to get it working as soon as possible.

It's not rocket science. Even an MD5-like would do it. This is meant to be used for non-intentional errors.

We may actually use some weak (and explicitly say that is weak) hashing integrity check so the users don't have a false sense of security.

1.1 Authenticated signatures

Authenticated signatures, maybe both with a secret (think password-like string) or public key authentication still worth having. Note that it is always still possible to just do this with entire source files (without using any HDP internal hashing to selectively ignore parts that don't matter) but at some point we may also release some way to allow authenticated/integrity checks also considering internals.

But the main point here is that if the default is not user friendly enough, or it could actually make users experience miserable (like keep track of several secrets just to know the authenticity, and then encourage bad usages) we may enforce everyone.

Also, we're aware one actually the average user base (instead of maybe use Git, like private repositories on GitHub/GitLab/Gitee) is likely way to share would be Google Drive/Dropbox/Etc and (even without considering "State Sponsored attacks'', but actually just someone stealing access from an collaborator to that cloud storage;). So actually may be desirable to use such features if the files themselves are saved outside an secure network.

2. Reflective quote "What's your threat model?" (Extra: memes added)

There are so many potential threat models that, at least in my personal option, we could either go for users' simplicity (while still operational) or go full military-grade authenticity, like use of GPG FIPS compliant smartcards ready to use on air gapped networks.

On image: meme about threat models

Captura de tela de 2021-03-21 22-10-25

Note that I'm very aware that (in special for potential users who create HDP files or process HDP files from others) the ideal perfect usage (think like an information manager working as an data hub for MANY other working groups) is the extreme of air gapped network, but our point here is that HDP files themselves shouldn't require the same level of sensitivity of data themselves. We may not be able to implement the most user-friendly implementations, but whoever processes the data or prepares HDP files to be exchanged, should care that the consumers must have some friendly way to check authenticity.

On image: meme about how we should not use ways to check authenticity (that is different from encryption) that average end user could use it wrong.

vault-no30-company-safe-acces-okay-so-the-password-is-2739121

3. Opinionated idea about not use security by obscurity or "strong algorithms" used wrong.

This is directed to people who would think that AES 256 is 2 times stronger than AES 128. This is from 2009, but for who undestand English, can give an idea of who just using strong algoritms can make things go wrong https://www.youtube.com/watch?v=ySQl0NhW1J0.

I also really like the idea of we try to focus on acceptable secure that is more likely to not be used wrong. Note that an good part of HDP itself, by allowing multiple natural languages, meet the criteria 2 on '2. Speak the user’s language!':

Source: https://www.usenix.org/sites/default/files/conference/protected-files/hotsec15_slides_green.pdf

Captura de tela de 2021-03-21 22-25-58

In other words, in general maybe the HDP itself as one way to exchange what is meant to be is likely to not implement features that are unsafe for average user, and when is not avoidable implement ones that can go wrong, we still keep simplicity by default while allowing who have advanced threat models fit an HDP on your current workflow.

fititnt commented 3 years ago

See


Not so fun fact: both SHA-2 256 and SHA-512, while are ok for being impractical to create files with same hash, in some very specific cases about how these algorithms are used, us possible to append more data while an tool trying to check if the hash is valid would find it ok.

SHA-3, BLAKE and even SHA-2 224 (note, smaller size than SHA 256) are not vulnerable to this.

MD5, obviously, is vulnerable. But at least does not give false sense of security

Potential idea for give room for file changes

While obviously the idea when receiving data from others is to use a decent method of authenticity, we still have a case where average people may like to save the files on cloud storage and these files are being changed.

An person able to edit the file obviously can simply also edit the hash. (The cloud storage could warn about data chances or, only with much effort, help police enforcement to discover IP or more info of who chanced, but the average user would not really look at the dates).

Another point is, even if do exist some way, like user be able to create some Authenticated hash from an password that only the user know, at bare minimum someone could simply replace an new versions with an old version (cloud storage even have versioned storage)

Note that I'm not against also allowing users to have some way to have their own authenticated hash, just that this alone don't some this treat model.

Maybe also implement (as external tool) something that saves hash of files to an file that doesn't be on shared storage?

With some she'll kung fu, I think it is possible to just save hashes of files that are on some folders. An user could, despite other functionalities, use this to at least know files that changed from last time the user did a check.

This approach obviously only makes sense if the place that stores the file with checksums is not on the folder that is also shared online (or, if not online, could be just an USB , some hard drive or a computer that the user is not there all the time.

fititnt commented 3 years ago

Relevant commits (I forgot to cite them):


I really liked the approach used by the author's on both the workflow to go sign (allowing even smart cards) and also the Firefox/Chrome extensions that do the check. But I'm not sure (except if we or someone else's ship tools based on the subset hxlm-js to bootstrap desktop applications) if we could manage to make GPG signed index pages to give too many false negatives (at least is not the opposite, but it would still not user friendly compared to plain HTTPS) The hxlm-js/README.md have some source comments that may either be removed or reviewed. But at this moment the html PGP signatures generated on the hxlm-js/index.html do not match (at least if using the chrome extension).