DHARPA-Project / kiara-website

Creative Commons Zero v1.0 Universal
0 stars 2 forks source link

What is the kiara store? #12

Open caro401 opened 7 months ago

caro401 commented 7 months ago

Ultimately, I think this discussion needs to end up in a big conceptual docs page which addresses things including but not limited to

There also needs to be a small summary and a link to this on the glossary page.

see also https://github.com/DHARPA-Project/kiara-website/pull/11#pullrequestreview-1734226183

I don't know the answer to any of these questions. Are there any other things I didn't know to ask about, are there any existing docs about this that I didn't find

makkus commented 7 months ago

Ok, one thing in advance: not all of this is implemented (or even thought through) yet, so I'll only speak to the things I implemented already and am sure it'll mostly stay that way. I might add some comments in other areas (and mark them clearly), but please don't mistake that for a plan or strategy, those are things that need to be discussed/finalized. Can be still included in docs, but we need to make sure they don't become stale and outdated.

Also, I'll be answering fairly technical, so ideally someone would filter and translate the relevant parts into non-dev-speak.

why does kiara have a store, why can't it just work with your files on disk? relate this to data lineage and maybe caching?

The main reason is really that kiara needs to have full control over the byte blobs it deals with, and make sure not a single byte changed. If that is the case because some external entity did something to the file, all the metadata kiara has about it would be invalid, but kiara has no way of knowing that is the case. In theory, kiara could hash files every time it toches them to make sure that didn't happen, but that would be very inefficient, and it wouldn't help with the fact that all downstream operations that have been done with a particular input file would be invalid or out of date, and we would have no way of 'proofing' that with that input we get that output (because we lost access to the original data).

One thing that is important to note is that if you store a value, kiara will also store every input and intermediate value that was used to create a value. This is important because otherwise we'd loose the integrity of the values lineage (basically a broken cold-chain for data). In addition, kiara also records metadata about the operations that were used, and the environment the operations ran in (but that is less important for this question I guess).

how does onboarding/import relate to the store?

All the operations with 'import' in their names basically copy external (to kiara) into either a temporary location (often memory) and subsequently the internal kiara data store (if the API user chooses to 'store' a value), so what I wrote above doesn't happen.

what does this mean for your computer's storage requirements etc?

Every import incurs some storage cost (duplication of the bytes that get imported). In some cases that is not relevant because we download a file from a remote location so we would have to pay that cost anyway. But for local files, it means doubling the amount of storage a dataset takes up (unless the user manually deletes the original file on their filesystem or doesn't use 'store').

what do you need to think about when deciding whether you should store something? When should you store a thing, when shouldn't you

This answer very much depends on what you try to do, and whether you use the Python API as end user or client developer. I'd imagine values would only be stored if a user wants to keep a specific result they are happy with, or that is relevant in some other way. And for imports of external files, but that would depend a bit on how that part of the app is designed, kiara stores any parent values of a value that is requested to be stored via the API, which would also include the value at 'import' time, so we might not need/require an explicit store command for that.

What can you store, what metadata can/should go with the stored item (aliases?), an example of how to do that via the python API

At the moment, only aliases are supported as non-automatically (aka user-specified) collected metadata when storing a value. We probably want to have more options here in the future (comments, notes, authors, ...). Even aliases are not really 'fixed' yet, since it's an area where I was waiting for frontend developers to share their opinions/ideas. Currently, an alias is a string value (no special chars except '.', '_', '-') that makes sense to a user so they can find the value they alias later on. Otherwise, users would have to deal with uuids, which are impossible to keep track of for humans.

Aliases can be overridden by the user, to point to a new/updated value. Currently, aliases are not versioned, but there is some placeholder code to make that possible in the future if there is a requirement. Also, multiple aliases can point to the same value.

The reason aliases are not finalized yet is because I think this is one of the central UX 'themes' in kiara (how to pick/reference/manage datasets), and I can see several different options that would have to be implemented in non-compatible ways (flat, hierarchical, namespaced, ...). I'm hoping that having to implement a real-world gui will spark some ideas or point to the best way of doing references to datasets.

Via Python, the easiest way to store a value with one or multiple aliaes is via the store_value API endoint:

api.store_value(file_result, "alias_from_python")

how does the store relate to contexts? there's a separate one per context?

Currently, each context has it's own data store. There is some code to prepare for multiple data stores per context, but that's not implemented in any useable way at the moment. In the future, it might be important to be able to access multiple datastores (either within a single context, or accross multiple contexts -- in the latter case those would probably be read-only), but that would have to be designed and implemented once we come across a use-case.

what is the kiara store on a technical level (its a database? what does it do with caching, does it ever clean itself up etc) Where does it live on your disk, what can you do if things go wrong with it

This is sub-API level, so anything here should go into a different section of the docs. At the moment a kiara datastore is a Python interface/base class that can be extended to store data in specific ways. Technically, we have archives (read-only) and stores (read-write). The only implementation that is used at the moment is one that uses a folder in the users home directory to store the actual serialized bytes of each value. Location is OS-dependent (use kiara context info config print to see where it is for the current context), but users nor developers should ever directly have to manipulate it. If things go wrong, currently the only thing you can do is delete the context it belongs to (that can also be done via the cli kiara context delete -- can't remember if that is exposed via the API), but that is obviously a very brute-force way of solving problems. In the past new versions of kiara often came with a new format of how data was stored so were incompatible between versions, but that is hopefull a thing of the past by now.

Currently, there is no way of deleting single values/aliases from a kiara data store, that is on my TODO list, but it is non trivial and I'd prefer to figure out data export first before I tackle this.

One important technical detail is serialization, and it relates to how data is serialized into bytes before it is stored. kiara implements this in a type-dependent way, which means that every data type has to implement it's own serialization (or inherit it from a parent).

Its fairly important to implement that in an efficient way, so the kiara store can de-duplicate data that is the same. This is a much larger topic to talk about, and probably needs its own section in our docs (how to create your own data-type). That's something I need to write myself, but I'd prefer to wait until we have a basic structure of docs because I expect I'll need to link to a lot of other stuff, and also it's not something I expect anyone but myself to be doing in the near future.

CBurge95 commented 7 months ago

Just a lingo clarification on my front: are aliases the equivalent of variables in python? Or something different? (Trying to translate into my 'knowns' to also try and explain beyond this)

makkus commented 7 months ago

Yeah, roughly equivalent I'd say.