EticaAI / HXL-Data-Science-file-formats

Common file formats used for Data Science and language localization exported from (and to) HXL (The Humanitarian Exchange Language)
https://hdp.etica.ai/
The Unlicense
3 stars 1 forks source link

[meta issue] hxlm #11

Closed fititnt closed 3 years ago

fititnt commented 3 years ago

This issue will be used to reference commits from this repository and others.

TODO: add more context.


Update 1 (2021-03-01):

Ok. I liked the idea of YAML-like projects!!! But may be easier to do the full thing than explain upfront. (I'm obviously biased because of Ansible, but ok; anyway I know is possible to even implement testinfra; but would be easier to create an "Ansible for datasets+ (automated) compliance" than reuse Ansible)

Also YAML, different from JSON, is much more Human friendly (for example: it allows comments!) so this can somewhat help.

Being practical, at this moment I think mostly will be wrapper to libraries and APIs already existing (aka syntetic sugar, not really new features). But as soon as the building blocks are ready, the YAML projects themselves become powerful!

fititnt commented 3 years ago

Ok. I think I will give up the idea of trying to make the code generate an Schema of what is on disk

Captura de tela de 2021-02-28 08-33-44

but do the opposite: let YAML describle what is on disk (or what should be the final state of what would be on disk).

Captura de tela de 2021-02-28 08-29-00

Turns out that this remember a lot about Ansible playbooks! but instead of entire group of servers, is an group of datasets on local disk! But even if on next days each of these points like hdatasets, hfiles, etc on the YAML inventory are already mapped to action classes, there would still be missing the equivalent of "ad-hoc" ansible tasks. The Ansible ad-hoc tasks would be what on HXL the name of these are recipes:

Why YAML over JSON

At this moment, I think that the main difference here over just use ad-hoc HXL-proxy recipes are the fact to start to have an inventory. Ansible separes what is inventory and what is task (so this can be used for several projects that are somewhat similar).

But the main idea that started to look at YAML was not even this about Ansible, but I remembered that YAML is easier to deal with comments, while still powerful to process via tools.

Special attention to the concept of compliance (this is likely to take months)

Do exist some building blocks to abstract, but one special attention to in the end come with an descriptive language easier to abstract is the concept of compliance rules. So we're not only talking here about have one common way to express concepts, but needs to be on local language and needs to support spaces, accents, etc, also on the key terms (not only on values).

compliance rules is somewhat the idea of compute an result of if something is authorized or not and what to do if is not authorized. Other thing could be compliance rules apply filters or at least require that the human ask permission of someone why some specific filter does not apply to some case. But in an scenario where people trust more an computer than each other (or actually trust each other, but need some explanation to not break laws that would need weeks or months to get clearance), if who approves feel safe about what is on an YAML compliance file and do exist people outside the organization that grant that it at least reduce human error and eventually allow faster data exchange for more sensitive contents.

fititnt commented 3 years ago

HFile already is able to reload files from remote sources (the first one that works will be downloaded, if already does not have copy on disk).

HRecipe already have some draft using the HXL-proxy recipes, but if we manage to make it work also with libhxl-python, the hmeta.yml project file can be used to play around with multiple JSON recipes.

Before going to compliance rules, I think we need abstract the JSON recipes. I'm not fully sure if some features of HXL-proxy are only from HXL-proxy and not libhxl. But one thing that really necessary for compliance is some quick way to discover the headings of each file, since some compliance may need to allow/block (or at least require human review to force on the hmeta.yml that's ok) based on the typical headings.

fititnt commented 3 years ago

I just realized that hxlm/data/baseline/hmeta.yml & hxlm/data/baseline/hcompliance.hmeta.yml in practice allow a sort of Declarative programming (https://en.wikipedia.org/wiki/Declarative_programming). [1]

Considering that the thing that really would be complicated would be the compliance rules (and compliance rules ideally should by design be strictly translatable even between languages, since some country/territory would take too much time to translate from local language to a common language, and this would be unacceptable) I think that the parts that matter should already be enforced. And go all the way Declarative programming makes easier for humans who approve what is right and what is wrong, while offloading complexity to who actually implements the software.

I will make some tests.


1:

The (not implemented) htasks would break this, but maybe worth to simply not implement it (or allow, but via command line operations).

fititnt commented 3 years ago

I was trying to keep cryptography dependencies out of hxlm package (because this could be done on the compliance extension), but at least when using urnresolver for very sensitive content, I think that just remote authentication is not good engouth. I don't say this because authentication is technically bad (for this even basic auth with HTTPS can be ok if still no strong requirement for centralized authentication) but HOW people use it. Also, whoud still need some sort of protection is URN index files allow to be cached on local disk instead of require remote access always.

This (people using wrong) may be in special:

  1. Under urgency to allow everyone to know URNs already are available (while just knowing that the URN exists and maybe some metadata would be sufficient, but access to this needs extra steps) when operating via command line tools--- someone at a higher level could make mistakes for sake of urgency. Example is rely too much on authentication while allowing access to entire group control because the person who set up some server authentication may be off to setup another separate group. Under such scenarios the ideal, in addition to some minimal generic authentication, also allow encrypt sub items with additional encryption
    1. In theory, for who already use some sort of abstraction, like GoogleDrive/iCloud/OneDrive this this is not such an issue, but note that urnresolver when operating via command line needs some extra abstraction, and and encryption could add an extra layer
  2. If for some reason access is lost to an local computer (or since I'm likely to make it work even with Android Termux, so it's not just local computer or servers) users would still have some File Based Encryption for URN index files that allow local caching
    1. Please note that not all URNs listing are equal. Some often even the contents (not just the URNs) should be already cached on local disk since annoying users to manually add an extra key (even different from other groups of URNs) could make users' experience miserable. For example "CODs", "FODs" and things that don't have personal information but are used to link/translate/convert other datasets the users could either use generic URN providers or some basic auth (or local mounted GoogleDrive/iCloud/OneDrive could already have all the common things), in my honest opinion, do not need more granular encryption.
    2. Also, local cache (maybe by default) would not need to ask some strategy for the user to encrypt if the tool already knows that the URNs the human is trying to save locally are already public. This approach could allow the genetic library to even remove later the need to install cryptography dependencies until really need

Note that in general (at least for URNs, encryption of file contents is out of scope, it's about how to find them) I believe if I have to optimize something, it should be security, not persistence or backwards compatibility with users' old local cache. If we make pretty trivial for who consume others URNs to rebuild from scratch (as long as they have valid decryption keys), the only main point here is that individuals who create individual URN index files, at least this human (or an close contact if some items are even more sensitive and instead of break in several files, decide to put on an single one) should know the unencrypted values. 

Anyway, there are some reasons behind just compatibility with program. Since is possible to add alternative locations for each URN, and the planned idea is eventually make possible to allow even different decryption keys, so different users requesting same URN identifier even from the same URN index files both with different set of keys may actually see different URLs for the same URN.

fititnt commented 3 years ago

The new hdpcli (the class HDP) first to be implemented because of the conversion from YAML to JSON processing specs (#14) actually will allow fetch instructions from remote hosts, like test files from GitHub.

hdpcli as "offline-first" usage

At this point there is no problem with this, but I think that by design may be better to start "in offline mode" and either require extra command from user or interactively manually ask if the user allow connect to the host (at least if we detect that is an interactive session, not running as part of some scripts).

Even with acceptable sandboxing-by-default (and eventual way to grant that public shared HDP/URN files would be signed by authenticity), there still the privacy point

By offline mode for example, the urnresolver #13 already has a draft of this (but the user needs to have files in some place of own computer (we don't have yet some directory structure to discover the files). But with the hdpcli as soon as is implemented allow load files from remote sources, even if the YAML instead of plain shell or python scripts are already an form of sandboxes, and even if hypothetically we de facto go as deep as implement signed files (e.g. think as far as we have every public URN list or HDP sort of file signed) we would still with... the privacy.

As for who would read this later, I'm not talking about privacy of eventual humans that are referenced on data managed by these tools, but the humans that would use the command line tools to automate tasks. Like would happens with anything that access internet, if hdpcli/urnresolver would be allowed to fetch data from internet, whatever is the host, they can know the requester IP address.

allow offline (structures cache) also as an way to mitigate overload remote servers requests

In particular if some the urnesolver becomes ok to use as standalone CLI os as part of other libraries (and not just as internal tools here), depending on how cli tools deal with caching, misbehaving tools could make a lot of requests to know available URNs. (This is also why the URN index files are likely to allow simple text files, even if this means allow users to encrypt just specific content and don't care if the file may be public access. This approach is somewhat to mitigate server loads while still keeping some way to find content).

Things are even worse if, since YAML would be easier to use even on a local machine which is often done with HXL-Proxy, the person often works with large files and don't dowload locally first. While these files may not be requested as often, they can be much larger than what the HXL-Proxy by default allows to use (that is a lot! It can easily pass 500.000 rows of data).

So do exist cases where a mix of allowing online and offline (or organized local cache) is actually useful beyond the privacy part. In fact, this is the biggest reason

hdpcli may have conventions that the data provider may allow enforce disable offline/local cache for certain contents

Even if very likely whoever has access to data allowed by other organizations already explicitly passed some trust check, if we automate ways to make local cache, I'm 100% sure this will conflict with specific usages. What I meant is that the ideal circumstance is at least the way URNs are documented could give a hint if the provider doesn't care to be overloaded with requests because is preferable this than the user having a local copy of the index files. (Note that users have access to URN that list available resources does not mean that the human will download them, just that the person knows that they exist and may already have a link to download then).

The problem if some URN indexes request this feature (and this provider is very important or is not untrusted by default) is... this could conflict with the privacy of the human. Maybe should be an way to (even if the user in fact based on boat already trusted level have access to the resources without human intervention) have some metadata that the URN provider allows users to keep local cache (like at least the urn keys, even if no metadata at all). This is something that still need some planning, because if at least would exist some way to know the URN keys, at least the user could "fetch" the URN index file (that, again, do not means that the person will try download the resource) if the URN file actually would at least have that resource. Also, for performance reasons, features like command line autocomplete would need this.

And if the dataset was already downloaded (in special if was not an internal cache of the tool that is cleaned implicitly)?

If would be a win-win situation (both to keep flow of data sharing while solving average issues like someone just releasing something that should not) may actually make sense allow something that can invalidate (aka kindly ask delete) an dataset that was downloaded by another peer. An perfect example is the twitter API (or at least the stream API that I tested years ago) that both users can have access to tweet data of others users, but the API can also send requests with specific IDs that when implementers see that, are expected to immediately delete the respective tweet.

While I'm not sure (and this actually would need a lot of thought on all bad things that could happens, like if implemented by default one organization could could actually ask to invalidate/delete massive amount of files from an different organization) I think that at least if is for individual, explicitly named datasets that very recently were shared, even if we're not marked as sensitive content initially (or need of some extra authorization) if in an new check for updates and info that such dataset was release by mistake, may be reasonable to delete the local dataset.

How to mark a dataset as "mistakenly" released may not be intuitive. For example: if one URN index file is both accessed by trusted personal and by a broader public, a generic request could delete even from peers that actually should have access to that files (or maybe even already have something with the same URN). So at least at this moment, I think that one way to, at least for like datasets announced on URN index files, if the local client hdpcli assumes that the user made an local copy (or have as local cache) when the new dataset was with direct access to one URN, and on next refresh (maybe like each 15 minutes?) the URL become encrypted (maybe some additional hint that could proposely be vague or redundant, but sufficient to the hdpcli knows that if the current user can't decrypt the URL, means that someone made a mistake in last hour) we could force delete local cache of local user.

potential conventions on how to deal with URN leak of data

Like I said, I'm not fully sure at this moment on how to deal with something like someone shared wrong and resource, but this is likely to happen in special when needed to do fast paced data sharing, so makes sense not be complicated to implement on client side. Maybe even draft how to resolve urn:data (#13) and document the URN index files could already, if not as an users agreement like a company like Twitter, as an moral code of conduct.

But just to say upfront: in general optimize sharing of sensitive data is by nature complex. Very likely people who would have access to this already have some level of trust, and while adhere to best data sharing standards is desirable, do exist contexts were this is, not just morally but by laws, secondary to other urgent needs.

fititnt commented 3 years ago

Just added the functionality of load files by expected suffix by directory and the HXL data processing specs (an array) already  (as soon or later I would expect) doesn't grant exact  order as when it is a single file.

I think that on average user usage (beyond single file) with my experience with Ansible (medium to big projects, including creating Ansible Roles) I will try to optimize for medium to large projects. I do not have experience from the very early days of Ansible 1.0, so most of the things to run partial playbooks very likely were ready, but I think that definitely the idea of getting a recipe by array index is so prone to go wrong that should not be even on documentation.

Analogies to Ansible

In some aspects the HDataset (and implicity an potential result of each recipe) is similar to Ansible inventory (The playbooks / tasks I think should be as abstracted as possible to mitigate user being decepcioned with order of execution; or at least we try to delay time need for an user have no option but know about order of execution? The ideal would be an user who is just consuming already working project be able to reproduce and things still working even if the user start to merge more and more files

Tags to allow control selection (include/exclude by tag)?

Ansible allows users to use tags a lot to both select what they want or, by tag, what they do not want (in fact personally I overuse that's beyond average users on Ansible, but ok). Here may be the case so we could at bare minimum already have such tags (but I already know that this is not sufficient; in special if is to reuse projects from others; and then the likelihood or try ask people to keep same tagging conventions; this alone is not good).

URNs, if used, would allow exact selection. But this may reduce reusability for inner parts and also could force users to make decisions too soon. But anyway, it could be a good idea to, if the user does not explicitly create exact URNs, we implicitly create based on context. Maybe one implicitly 2 letter iso country for "localhost" if the user did not select or receive project files already well organized?

One thing that Ansible has for hosts is both 'localhost', 'all' (that include all hosts, except localhost, because this could break things like install/remove things from their own computer!) and 'ungrouped. Maybe we could create another pseudo concept, that would be almost like tags, but instead of allow apply like even for subitens, require be more "top level" like hsilo? 

TODO: maybe (if implemented) in addition to something like 'localhost', 'all' and 'ungrouped should have something that tells explicit rules that are loaded from outside localhost? Like if the user explicitly  allowed load rules from outside the default safer network? Also, different from Ansible, actually most of the things would be localhost, so actually the interest is to protect (or deal differently) what is not from local.

Another thing I'm considering doing is, if an user adds an tag directly on scope of hsilo, the tag applies also to anything that was on that silo (including hrecipes, hfiles, HDataset, maybe latter even included_files).

About hsilos and how to "select by then" (Ansible uses concept of host groups)

I'm thinking about actually not force that a single hsilo could have an unique name (like an unique ID), but tolerate (maybe strongly recommend; or at least make easier to users would tend to this) that by labeling an silo on an group, that group would make every hsilo with that thing actually "part of the same silo".

This somewhat would work like tags, but groups (using this approach) would only apply to top level. Anyway, if the user really wanted an exact id for a file, it could simply create a very unique group name (or force an URN, that in this case could act like an prefix... But if we document use an unique URN as base, then this would break the concept of hsilos like a single silo but different files, hummm...).

End comments

The JSON schema (the file used to help with validating YAML files while using applications like VSCode) can actually help to enforce what could or not be in each file. So anyway whatever become the implementations, by using the helper, the user could have feedback without waiting to run hdpcli and receive errors.

fititnt commented 3 years ago

before

Captura de tela de 2021-03-15 16-46-52

Now

Captura de tela de 2021-03-15 16-47-12

fititnt commented 3 years ago

Just create the issue [meta] HDP Declarative Programming (working draft) #16.

Considering what could be on short to medium term production ready, even if abstraction with only YAML with this Domain Specific Language HDP would not be as powerful as plugins directly to Python code itself, this may be more realistic without require people not only start to use HXL on these contexts, but allocate individuals that would be able make it using Python and the person not be scared that data themselves are very sensitive.

1. Part of the functionality of auditing could be moved to filters instead of require custom python code (needs testing)

Do exist some extra points (they still relevant with some YAML files that someone is overriding a default behavior) but that would be even more essential with plugins in plain Python: the code would have to me even more strictly audited than if at least most common features could be already possible using and DSL like language.

If we manage to draft reusable HXL data processing specs on YAML that could be challenged with testing data (even if such tests would not need to become public) this could help to spot common errors. These "errors" like a human being aware that a customized rule allow pass private data could be ignored if the human is able to check that the authorization allows that.

Note that this type of test is not applicable on all types of data sharing. But in cases we're do exist more explicit restrictions, it could be used.

2. Considering the idea of files that mention datasets by default don't require they on the same folder

Weeks ago one screenshot had a way to express datasets inside the current folder. If the urnresolver`: Uniform Resource Names - URN Resolver #13 plus conventions on how to represent an URN on some base folder on local disk become viable, the use of HDP-like instructions would never store the data themselves where the HDP files are.

This approach could both solve the problem of storing files on separate disk partitions (or maybe an S3-like storage) while the metadata files could be handled differently.

3. Avoiding defining new keywords to define terms by... Simple have translations to every language people care to use (or allow someone trusted to give a file that add missing terms)

One very hard decision I discovered when planning for example the best hashtags to use when sharing datasets with @HXL-CPLP is that this decision is sometimes hard if we try to find something that is more universal.

With the HDP, the drafted idea is in addition to the internal terms (that if using Latin script is... Latin) has both one canonical term to translate if is to a known language and if converting from such language, can understand some extra aliases. To simplify a lot of translations, most HDP keywords are single somewhat primitive words. In special cases for macro languages (Both Arabic and Chinese) this means that I already know that some terms are impossible and eventually will need to implement the variants. But at least we're already have from start that allow localization! 

As complicated as it may sound to tolerate such a level of localization, considering time needs "to fix", this seems more easy to fix permanently than alternatives. Also, the fact that the terms can be in the people native language simplifies a lot of documentation.

End comments

The three points above may give an idea of why the ideal of full "Declarative Programming" (here as abstraction on how things are really done) can be actually harder to implement, but may be less harder than alternatives.

At bare minimum, it provides some level of sandboxing compared to allow full Python. Also the extra requirements/restrictions may help to implement early what is viable to be used in production.

And, under context of HXL, the idea of for example allow localized terms to express commands is actually feasible for an programming interface (like the Excel formulas, as Example, often are on the person native language) but is not if deciding good reusable hashtags for datasets. I mean: the number of possibilities of keywords of a programming interface is controlled.

That's it!