DHARPA-Project / kiara-website

Creative Commons Zero v1.0 Universal
0 stars 2 forks source link

Kiara stores, and the export feature #28

Open makkus opened 4 months ago

makkus commented 4 months ago

Placeholder issue for documentation about the upcoming export feature, incl. a writeup about kiara stores.

makkus commented 4 months ago

To put the export feature into context, first a quick introduction to kiara stores.

kiaras central context management happens via several registries, each of them responsible for other aspects of kiaras internal workings. For our purposes, the relevant ones are:

Common to all of them that internally, the manage pluggable so-called archives & stores. I'll be calling those mostly 'archive' from here on out, but might mean 'store' on occasion. The (only) difference between those two is that archives are read-only, and stores can be written to. Also, you can create a store, write to it, and from then on treat it as a read-only archive if you want to make sure it doesn't change anymore. For example for, well ...archiving... purposes. A store can be an archive, but not the other way round.

Each registry has a default store, that is used to store any instances of the items a registry handles, unless otherwise specified. Each archive has an ID (uuid), and is registered with a human-readable alias into the currently active context, which is used to specify them explicitly if necessary. The default stores alias is default, but in most cases that does not need to be specified, as it's the ... default.

Data registry

The data registry is responsible for managing the actual data (bytes) of the all input and output values. It assigns them (value-)IDs, validates their types, manages metadata, serializes and deserializes them, etc. Those last bits (serializing/deserializing) are the most important bits, because that allows kiara to persist values across user sessions (if users use the 'store' method of on values).

Storing must happen into a store, and as mentioned above, usually that uses the 'default' data store. Currently there exists two different implementation of the DataArchive class that defines the interface such an archive/store has:

The main difference is how the stored data is written to disk, the filesystem archive has it layed out across a folder in the filesystem, the sqlite one inside a sqlite database file. Both have advantages and disadvantages, the filesystem is better for random access to the stored bytes, sqlite is better to share, since it's just a single file.

Up until now, only the filesystem archive type existed, and thus was the default type that was created automatically for a new kiara context. I will probably change that in the future, but for now I'll keep that default.

If you want to change the default type that gets created for new contexts, you can change the default_store_type in the kiara config file from filesystem to sqlite (if it does not exist, you have to add that key). The path to the config file can be found via kiara context explain -c (will work from kiara >=0.5.10).

Alias registry

Most things that I described for the Data registry are similar for the Alias registry. We have the same two types: filesystem & sqlite, most importantly, with filesystem being the default (for now).

The alias registry is basically a key value store, that lets users assign human readable, (to them) meaningful aliases to value IDs (of type uuid). Its similar to how filenames map to inodes on a filesystem, just much simpler.

MariellaCC commented 4 months ago

I don't know if the following questions belong here (please let me know if not and I will reproduce this comment where it belongs):

Does the former data store (https://dharpa.org/kiara.documentation/latest/usage/getting_started/#checking-the-data-store) still behave the same way, i.e.: are the recommended ways to access data from the CLI still: kiara data list, kiara data explain data_item_name and kiara data explain --properties data_item_name ?

I am trying to find out the date and time at which a data item was added to the "store" (or archive), how should I do it?

makkus commented 4 months ago

Yeah, as long as there is a (sub-) command in the cli that works it's a supported way to do things. You can also do --help with every command, if the help text that is displayed there is not sufficient, open a ticket and I'll update it.

(Sidenote: the 'develop' plugin also adds some sub-commands, those are not really that thought through, but the usual suspects context, data, data-type, module, operation, pipeline & run are).

At the moment kiara does not record any time/date related to values. That is a good example of stuff you'd add to your list of requirements (independent of whether it already exists or not).

makkus commented 4 months ago

Ok, following some documentation about the initial version of the export feature. This was a big change, and lots of code, and there are bits and pieces I'm not sure yet I got right in terms of design/interface, so consider this as an initial version for testing and feedback. I don't expect the actual archives that get exported to work in later versions yet, their internal structure is very likely to change a bit, and its not feasible to keep them compatible for now. This will of course change before we hit production.

Overall, I also expect there to be quite a few little things and issues I missed, as the surface of the feature is surprisingly large. I intend to squash those with the help of your bug-reports and still-to-be-written tests.

I'll release a release-candidate version of kiara soonish, so people can try it all out, but wanted to make sure there's some docs first. I'll let you know when that is the case.

kiara archives

As usual, naming things was really hard, because a lot of concepts that would be good names were already taken and mean something within kiara already. I think (hope) it will be safe and minimally confusing to use 'archive' in general, even though internally the meaning of it is slightly different. But those inner workings should not really be exposed to intermediate or even advanced API users knocking on wood.

So, in our context here, a 'kiara archive' means a file (usually with the extension .kiarchive) which is a sqlite database and contains data as well as (optionally) alias information. You can export to an archive, which basically means you copy data and/or aliases from the default store(s) in a kiara context into an archive file. And you can 'import' an archive, meaning you copy the data/aliases contained in an archive into the default store(s) in your current kiara context.

(as a reminder, the current context is just a sort of workspace containing the stuff you are currently working on. Whether we want users to be able to change between/manage those is a question we haven't answered yet, and that will hopefully come from some frontend requirements)

archive export

Assuming you have data in your current kiara context (as well as maybe some aliases), and you want to export those into an archive file, you can use the cli as follows:

kiara archive export backup.kiarchive

This will create a file backup.kiarchive with all of your data and aliases. If you don't want to store aliases in the archive, you can use the --no-aliases flag.

If you want to append the data to an existing archive, you need to use the the --append flag, otherwise kiara will tell you off.

Via the API you can use the export_archive endpoint. Check the code docs for this (sorry, can't really link to the right line beause that will change), if some information is missing or unclear, let me know.

An example would be:

result = kiara.export_archive("my_backup.kiarchive")
print(result.errors)

The result here will tell you details about how the storing went, and if there were any errors. The result type is of type StoreValuesResult, which is basically a dict containing single StoreValueResult items as values ( https://github.com/DHARPA-Project/kiara/blob/develop/src/kiara/interfaces/python_api/value.py ), in case you want to see what information you can get from it. Errors is the important one though. As always, best to use your IDE to jump to the source code of the types in question, and have a look.

archive import

This works very similarly to export, except you use an existing external archive to import its data into your current kiara context:

kiara archive import backup.kiarchive

The only option here is the --no-aliases, which does the same as it does for export.

API usage:

result = kiara.import_archive("my_backup.kiarchive")

Again, check the API endpoint docs for more info.

Archive info

To see some details about an archive, you can use:

kiara archive explain my_backup.kiarchive

Or:

result = kiara.retrieve_archive_info("my_backup.kiarchive")
dbg(result)

(dbg is a helper function I wrote that should be available without needing to import)

The result type would be KiArchiveInfo ( https://github.com/DHARPA-Project/kiara/blob/develop/src/kiara/models/archives.py ).

makkus commented 4 months ago

Exporting one or several values

In order to export one or several values into a new (or existing) archive, you can now use this via the cli:

kiara data export journals_tables <optional_other_value_ids/aliases>

This command has a few additional options, so check out its help for more info:

kiara data export --help

Via the API, you would use the export_values endpoint:

values = ["journals_nodes_table", "journals_tables"]
result = kiara.export_values("my_archive.kiarchive", values=values, alias_map=True)
dbg(result)

Instead of aliases (strings) you can also use Value instances or results of computations, like:

values = {"journals": journals_val_variable_name, "nodes": other_value_var}
result = export_values("my_archive.kiarchive", values=values, alias_map=True)

The alias map here is a bool and tells kiara to also store the alias as they are into the new archive (in case of a dict it uses the keys). As this has quite a few potential different ways of using it, so as always check the endpoint docs for more in-depth information.

This will automatically also store all values that are somehow related to the ones you specify (inputs in its lineage tree, properties, etc.).

makkus commented 4 months ago

Importing only a subset of values an existing archive

In order to import one or several values from an external archive into the default kiara store, you can use the cli like:

kiara data import <archive_file> <alias_or_value_id_in_archive> [optionally more alias or value ids]

So, for example, importing the alias y from archive export_test.kiarchive would yield:

kiara data import export_test.kiarchive y

Via Python, the same thing could be done via the import_values endpoint (again, look up that endpoints docs for more details):

result = kiara.import_values("export_test.kiarchive", "y", alias_map=True)
makkus commented 4 months ago

Archive registration

This is a quick description how an implementation detail for the import/export features works, but keep in mind that this could be changed in the future, it's just a high level overview so the reader can get an idea how this is done internally.

In order to copy values to/from an archive, kiara internally 'mounts'/'registers' the archive as both a new data and alias store. While doing that, it gives it a temporary, internal name (an alias, really, but I don't want to create confusion with the other meaning of 'alias' here). This name is used in store/data_archive/alias_store parameters of the store_value & store_values endpoints. As well as the source_registered_name & target_registered_name of some of the other new endpoints listed above. By default, if not explicitely specified, kiara uses the archive file name to auto-generate that name.

In the case of a new alias_archive being registered into the kiara context, you can now try the list_alias_names endpoint and should see the result containing all aliases that are contained in the external archive, prefixed with <archive_name>#, so the full alias within the kiara context would be <archive_name>#<actual_alias>.

Currently, all the new API endpoints expect the external archive to not be registered yet (except for store_value and store_values, but those existed before this feature was introduced), and will register the archive themselves. This will be extended later so already registered archives can also be used, but for this I need to think through some edge-cases, and how to handle them.

To manually register a new archive, you can use the register_archive API endpoint. Check that methods documentation to get a better understanding of its parameters.

As I said, this is an implementation detail, so it's not 100% necessary to understand exactly how this works. But if you think this is something you need to understand, ping me and I'll expand on what I wrote here.