Handling multiple encodings

swcurran commented 2 years ago

We have come across scenarios when there are multiple levels of data type, encoding and formats, and I'm wondering how best to express those in OCA. Here are some examples:

an attribute that is a text, base64 encoded image that might be JPG or PNG
an attribute that is a text, base64 encoded json structure
an attribute that is a hashlink (simple terms -- a URL that has an query parameter that is the hash of the content of the URL) that references a base64 encoded image.
An integer that is a date -- e.g. the integer 20,220,727 that represents the date 2022.07.27
an attribute that is a hashlink to a verifiable credential

It seems like there is flexibility in the combination of overlays (data type, encoding, format), but I'm not sure the general concepts of how to apply them. Are data type and encoding constrained, and everything else should be in the format? If so, how should format be used?

blelump commented 2 years ago

@swcurran ,

To be discussed later.
in capture base: binary, encoding overlay: base64, format overlay: application/json
When you'd first capture this information you'd use two form fields. First that provides the hashlink itself and second that says what type of hashlink it is or what is it about. In this case you need two attributes in OCA Capture Base to describe it. If, however, you always assume that the hashlink is an image, then this is out of the OCA scope.
Is it always the case that the date is given as number, so 20220727 or is it incidental? If the latter you shall align first to the defined attribute type and formatting. We're now working on a tool that would deal with such a cases, but not yet for any resource, ie. whether this is CSV file or PSQL DB.
Same as 2nd point basically.

blelump commented 2 years ago

As for 1st point,

I see the issue. We want here to be explicit in what the schema owner can define here, especially because we want to enforce that captured data is compliant to the schema. This is possible only if we control what is being captured. Otherwise any validator we'll provide will not be able to ensure captured data is what it really is. So what we were thinking is to support a set of defined/allowed mime types for format overlay. So for image you can do for example image/png,image/jpeg, ...

swcurran commented 2 years ago

Some comments related to the numbers I used in the first note.

1, Yes I agree with your solution -- to a degree. But my point is that it may take several layers to get to the final data. I've noted two in the example -- base64 and image -- so at least that is needed, but in theory, there could be more. We want the parties to know (a) the type of the content in it's usable form, and (b) all the steps needed to get from there to storage and back.

I disagree. The field is a single attribute that to store you have to take an image, base64 encode it, hash it, publish it to a URL, construct the hashlink (from the URL and hash) and put that in the attribute. To extract, you have to do the reverse -- resolve the URL to get the object, hash it, verify the hash, base64 decode and process/display the image.
The field is always a date. We use that format because AnonCreds only allows predicates on Integers, so by putting a date into that format, we can do a predicate. I'm hoping you are familiar with AnonCreds Predicates :-).

blelump commented 2 years ago

ad. 1: Then it is precisely capture base: binary, encoding overlay: base64, format overlay: image/jpeg. You use B64 to store the BLOB and in format overlay you define what exactly this B64 represents.

ad. 3: Lets start from the beginning. Assume you capture information, ie. an img BLOB as file field in the form and you whitelist for any type of img. So far to achieve it via the form renderer to capture this, it'd look like: capture base: binary, encoding overlay: base64, format overlay: image/jpeg (this is not quite any type of img, but we'll get to that). From the capturing perspective it perfectly makes sense. Now, we add another requirement that this img is stored somewhere and is resolvable via some hashlink. Notice hashlink is an implementation detail here. In essence it serves as a 3rd party datasource that you use to store the BLOBs and resolve them via hashlinks. Same you'd do with for example IPFS or even AWS S3 if you'd use it as a data store. In other words this is not OCA concern where you store the data, because OCA is data store agnostic. OCA is purely about context preserving data capture and it doesn't care where the data is actually stored.

OCA could provide resolution from various data sources and furthermore it also could provide additional overlays to preserve this information, but this is upper layer to OCA. By analogy to the onion architecture, where each next inner layer of the onion has different responsibility with the core domain in the centre of the onion. OCA is the domain description, so the core of the onion and upper layers do care about persistence, capturing and so on.

Let us think about new overlays for that.

swcurran commented 2 years ago

So, in understanding item 3, you are saying that OCA captures the data after you retrieve, it and it is up to the implementation to detect that the item is embedded in the data structure, or there is a link in the data structure to the item. Have I got that right?

THCLab / oca-ecosystem

Handling multiple encodings #1