DHARPA-Project / kiara-website

Creative Commons Zero v1.0 Universal
0 stars 2 forks source link

kiara features #29

Open makkus opened 4 months ago

makkus commented 4 months ago

Here is a condensed version of the main kiara features. Most of them you should already have come across if you have used kiara in the past and went through the existing tutorials I wrote (information in those would be complementary to this -- I might not have included stuff here that is contained in there). This information here that can be found in more details via (mainly) the KiaraAPI source code or discovered via the cli (using the --help option o the kiara command and sub-commands).

As we've discussed before, the more important features are the ones that come from frontend requirements via the use-cases, those should be the 'official' ones. The backend won't be our product, the frontend(s) will be.

So there is a good chance I have missed some obvious, necessary features (since I haven't gotten real frontend requirements/specs yet), and in some cases I did deliberately not implement anything (like notes/comments) because the design of necessary API endpoints would depend entirely on how the frontend implements a specific feature. To that end, it's also important to note (again) that this API is the result of me guessing the endpoints a frontend might want to use, and designing them in a way so that changes can be made relatively easy. Which also means that I sort of assumed we can adjust the endpoints once frontend development gets on its way, and 'real' requirements crop up.

Obviously, that won't be possible every time, but given the constraints in the project it's something I tried to account for. So if you come across requirements that are not implemented, or slightly different to existing endpoints, let me know and we can discuss and adjust.

makkus commented 4 months ago

Introspection

kiara lets frontends introspect basically every aspect of its internal workings, if something is not exposed, then there is a good chance it can be added.

Objects that can be inspected include the current environment the backend is running (Python env, OS, etc.), which plugins are installed and their details (documentation, which modules do they contain, data-types, operations, pipelines, archives, ..), the current config and runtime config. It also lets the the frontend inspect which contexts are available (because the user created them some time before) and their details (like archives it contains), which context is currently used (and its details), and it will let users create and change to new contexts. A context is basically just a workspace with its own data values and aliases.

In most cases, kiara has two different methods to retrieve information about each of the internal objects, one quick one that contains only basic data, and one expensive one that contains basically everything kiara knows about the object.

For details which exact internal objects can be inspected, read the KiaraAPI source, it probably does not make sense to duplicate the information here.

makkus commented 4 months ago

Modules, Operations, Pipelines

Those are arguably the most important concepts within kiara, and introspection is available for all of them. Custom operation instances can be created via the API, a list of all pre-registered operations can be retrieved, and the details of each operation can be inspected. The same goes for pipelines, and modules. There are also methods to retrieve a filtered list of operations, pipelines, and modules, which is useful for specific use-cases like only presenting operations to the user that work with a specific file-type.

In addition, the API lets the user register pipeline stuctures, which in turn can then be used like any other operation. Pipeline config can be a local (to the backend) file, or a dictionary containing the pipeline config data directly.

kiara also has so-called operation types, which are basically categories operations can be sorted into. One operation can be of multiple types. One such category would be 'pipeline', for example, indicating to you that this operation contains multiple sub-operations. Another use is for operations that should have a known, same interface, regardless of their input data type, which could be useful in specific frontend use-cases mabye.

makkus commented 4 months ago

Data / Values

kiara has its own data types, which as everything else can be inspected. kiara data types wrap around Python data types, and provide functionality that is used by the introspection features, as well as for purposes like validation, serialization, and deserialization. Also, in some cases to parse strings into the actual data, as strings are often the only way for users to directly input data (like in the case of the CLI, or yaml/json config files).

kiara lets frontends 'register' data, which involves creating said wrapper around it, after validating against the data type it is supposed to represent, calculating it's hash, and assigning it a globally unique value id.

A kiara context keeps track of data using so called 'data-archives' / 'data-stores', which are classes that implement the necessary management features. Curently kiara has two types of stores: sqlite and filesystem, up until now 'filesystem' was the default, but that may/will change in the future (transparently to the user). Stores keep track of the values they contain, and are responsible for serializing and deserializing them. In most cases the details here are not important for end-users/front-end devs, as the API abstracts away the details of the stores. That might change in the future.

kiara also has a feature that lets users assign meaningful (to them) aliases to values, similar to how filesystems work for Operating Systems (filenames -> inodes). This might or might not be useful for a specific frontend, it's ceratinly possible to create one without using this feature.

The metadata that is contained in a kiara value includes its 'pedigree' (direct ancestor values & module type used to create it), 'lineage' (all ancestor values & module types used to create it), details about the environments it was created in (not fully implemented), its data type, pre-computed data-type specific properties, the hash for its serialized form, and a few other things I probably forgot.

From the next version onwards kiara also supports exporting one or several values into a file that can be shared with other people, and used by them to import those values (incl. their metadata) into their own contexts.

The API supports listing the ids of all values in a context, retrieving details about all of them, a filtered list of them (e.g. only values that are of a specific data type). Deleting of values is not supported atm, as it's surprisingly complex to implement and not a high priority for the use-cases I came across.

makkus commented 4 months ago

Job management

kiara contains a component that manages when/how the jobs the user wants to run are run. In most cases a frontend would use the 'queue_job' endpoint, which returns a reference (uuid) to the job, and which can be used to retrieve the status and eventually result of the job. The job manager also has introspection features, like listing all jobs, and retrieving details about a specific job. There also exists a 'run_job' endpoint, which blocks (so probably not suitable for a UI frontend), and returns the result of the job directly. In addition, there are also 'queue_manifest' and 'run_manifest' endpoints, which are lower-level and not recommended for use by frontends.

makkus commented 4 months ago

Configuration

The basic aspects of a kiara backend can be configured, this is split up into base configuration (KiaraConfig) and runtime_config (KiaraRuntimeConfig). Which configuration options are available can be looked up in the source code, or retrieved via introspection (get familiar with the cli and it's --help option to find what you are looking for, any further questions I'm happy to help with as always).

In most cases, configuration should not be necessary as the defaults should be sufficient, but if you have special requirements, then check out the configuration model classes.

makkus commented 4 months ago

Let me know if you think I forgot anything, that's entirely possible.

makkus commented 4 months ago

Also, happy to expand on anything that is unclear.

MariellaCC commented 4 months ago

The backend won't be our product, the frontend(s) will be.

just a quick comment about that: I think this shifted since the early stages of the project, since 2 years ago we defined a user persona of "modules creators" (via community plugins)

(there is a discussion about that here: https://github.com/DHARPA-Project/kiara-website/issues/5)

makkus commented 4 months ago

What I meant is that with just the backend, kiara is fairly use-less. We want people to eventually do research with it. Just writing a module without anyone ever using it obviously doesn't make any sense, so module creators can't be our main persona.

For doing research with kiara, we need to have at least one frontend, even if its just a very thin one like using the backend via jupyter (and the requirements that this would put on the backend).

makkus commented 4 months ago

And this means that whatever features are important, they need to be defined/arrived at via frontend requirements. Otherwise there is no justification for any of them to exist.

MariellaCC commented 4 months ago

For doing research with kiara, we need to have at least one frontend, even if its just a very thin one like using the backend via jupyter (and the requirements that this would put on the backend).

Sure this is not contradictory at all, but I think it is important to not forget about these module creators (who know data analysis/statistics python ecosystem and not software engineering python ecosystem, which are 2 very distinct areas of expertise even if both fields are technical/computing related and that Python is a common word)

These users are important since they would bridge the gap for modules that are non existent and needed, as the dharpa team won't have the possibility to anticipate all needed mdoules.

caro401 commented 4 months ago

How do these somewhat abstract things map to what a user (of any kind) can actually do with kiara? Given a (future, imagined, all-powerful) UI or python expertise, what kind of tasks can I do, or what things can I achieve using kiara?

My best guess from the above is as follows, but as you can tell from the amount of question marks, I really don't have a clear picture of what's currently possible

Does this list cover everything kiara can currently actually do?

And does this cover all the end-user needs that have been identified in the various user research/surveys that have been done?

makkus commented 4 months ago

Ok, again, the main point I'm trying to get across is that I did never got 'real' requirements from someone responsible for the frontend experience. I had to 'translate' the end-user needs we collected into a backend design, without knowing about the thing between, what is between the end-user and the backend API. I hope it's clear that this is not how things ideally work, but that was out of my control, and this is why the list of features I gave you is fairly generic/abstract. For some of the requirements (notes for example) I found it impossible to come up with an implementation without knowing more about how a frontend intends to use it. Does that make sense?

I can answer all of the questions above, but before that I'd like to make sure we're on the same page about our basic premise:

If we can all agree on that premise in some way, that would be great. Then I'll go through all of the questions above and comment on them, assuming behind all of them is an an actual use-case and reason why those things should be possible.

And I'd suggest that we come up with descriptions / wireframes / or whatever details we have about our two (? not sure about the topic-modeling one) mini-apps. We know what data will go into them, and what users want to do with them (users in this case being Caitlin and Lorella/Mariellal), and we use that to come up with a list of specs/requirements the backend needs to satisfy?

As I said, it's fairly likely that some of the stuff kiara can do is not necessary at all, and would not be used at all by such a mini-app. Like, aliases, we might not need at all. Sharing workflows/pipelines, I can't really see how that would be important for a mini-app, but if you can show me a usage flow for that, I'm more than happy to implement it or change an existing implementation.

I guess the short of it is that I really need your help with all that, and that I dont' have all the answers, and also that we should probably ask ourselves whether we should talk about all the possible features we could have or that exist, or limit our discussion to only what is needed for our specific next goal, the mini-apps. And those should have their own requirements inbuilt, totally independent from what exists atm, right?

makkus commented 4 months ago

Sure this is not contradictory at all, but I think it is important to not forget about these module creators (who know data analysis/statistics python ecosystem and not software engineering python ecosystem, which are 2 very distinct areas of expertise even if both fields are technical/computing related and that Python is a common word)

I feel like I spent a huge amount of time making this easy, writing tutorials on how to create modules targeted at that audience, creating the plugin template to make it easy to get started, investigating ways to make it easier to create a Python env (pixi), etc. To be honest, I'm running out of ideas (and time) to spend a lot more on this, and I was hoping that the docs sub-project could take off some of the load and others could jump in and clean up what already exists. If there is something you want specifically me to do, I'm also happy to do it, but as I said, my own ideas are sort of running out, and I could really need some help there...

These users are important since they would bridge the gap for modules that are non existent and needed, as the dharpa team won't have the possibility to anticipate all needed mdoules.

Again, I feel like I've always had that in mind, and tried to make that as easy as possible. There is a discussion to be had about the quality of the modules we ship 'officially', but that would be independent of this.

MariellaCC commented 4 months ago

I feel like I spent a huge amount of time making this easy, writing tutorials on how to create modules targeted at that audience, creating the plugin template to make it easy to get started, investigating ways to make it easier to create a Python env (pixi), etc.

there was not a feature request at all in my previous message, this was just to acknowledge the existence of such back-end users, nothing more

MariellaCC commented 4 months ago

And I'd suggest that we come up with descriptions / wireframes / or whatever details we have about our two (? not sure about the topic-modeling one) mini-apps.

I do not understand what you mean by "not sure about the topic-modeling" one ?

MariellaCC commented 4 months ago

Concerning the topic modeling one, I think you saw the initial jupyter notebook, the wireframes, the list of inputs/outputs, the modules roadmap, and you participated in several of the functional previous prototype versions. Could you please explain what would be needed at this stage?

makkus commented 4 months ago

I do not understand what you mean by "not sure about the topic-modeling" one ?

Just that I'm not sure if we're planning to have a topic-modeling mini-app.

makkus commented 4 months ago

Concerning the topic modeling one, I think you saw the initial jupyter notebook, the wireframes, the list of inputs/outputs, the modules roadmap, and you participated in several of the functional previous prototype versions. Could you please explain what would be needed at this stage?

A frontend dev who takes all that and designs/architects the frontend? Decides on the technical details, how it's implemeted etc. As I said before, I can't do that.

MariellaCC commented 4 months ago

Just that I'm not sure if we're planning to have a topic-modeling mini-app.

Ah ok, this is something to confirm with @caro401 indeed. If not, I could come up with a streamlit one and/or be available to help if anything is required. At the moment I am preparing the modules, and using this as an opportunity for doc material.

CBurge95 commented 4 months ago

Just weighing in here - we absolutely will still be having a topic modelling mini-app, and all the modules that are written for that plugin can be used for the app as well as all the existing use-cases (i.e. the CLI and Jupyter), so prepping the modules and doing the documentation is hugely valuable work, thank you Mariella!!

The goal is ultimately to have a topic-modelling mini-app and a network analysis / Tropy one, which (in essence) can use the same UI framework, and these will be the plugins/modules that are currently under development. These will also be useable in Jupyter notebooks, just with obviously a little more flexibility in terms of potentially combining plugins / introducing new ones given that it will be outside of a UI framework.

In terms of @caro401 's initial features question, aka at it's most basic, what can kiara do for a researcher:

  1. Data can be uploaded into kiara's datastore / be converted into a kiara data type so that kiara can understand what it is and trace it.
  2. Ignoring specifically what the jobs are, it can trace 'inputs' (aka decisions made and parameters given) to given modules, and assigning the 'outputs' (aka the 'new' data) a traceable unique identifier.
  3. Following this process it can be exported as a) a new file type (for example graphml with the network one) and b) the lineage listed.

In all of this kiara acts as a 'wrapper' to the process, tracing and recording the metadata. Markus I understand this will probably be 'higher level' rather than technically accurate but as an overview of kiara from a very basic user point of view (removed from any consideration of API or frontend), this is correct yes? Essentially this is what Caro needs for the mini-app (I believe) so just confirmation or correction on this would be great.

In terms then of building the mini-app(s) what is needed is really:

  1. The python API - this doesn't need any adjustments, what is already there works great going forward and anything in terms of 'requirements' or adjustments will come once we start building the UI and may have certain questions but for the moment, the API that was built or implemented (whichever the correct term is) for allowing kiara to be used in Jupyter is the exact same that is needed for the mini-apps. What is primarily needed in terms of this is just documentation - not detailed in terms of how the API was built (at least, not as a high priority, though this will obviously be needed eventually) but in terms of what is it called to run a job, or the 'get info' on a module. If these are the same as in the notebooks then great, if they are different please link us to where this information can be found and/ or clarify this here (these are only examples, I can't think of anything else off the top of my head but Caro may have more specific ones)

  2. Plugin modules - for Mariella and I to write, assuming that the initial ones (onboarding / tabular) are stable as has been said in the last couple meetings, which can then form the basis for the mini-apps. Wireframes / order narrative etc. already exist in the notebooks / elsewhere in development notes and have for a while.

  3. The mini-apps UI itself - Caro working on this w/ input, using the kiara backend (Markus) and the relevant plugin modules (Mariella / me as in point 2)

  4. (low priority) - after testing / general basic stability we might want to talk to some of the design developers and 'pretty it up' so to speak but this comes much later down the line.

Some of this probably goes off this initial issue raised and also doesn't cover everything (like notes, for example, but we can put a pin in that until the end of February at least) and some of it requires a little further discussion, but for the main part this should set out some idea of what is needed versus what already exists and does or doesn't need further work on.

makkus commented 4 months ago

Markus I understand this will probably be 'higher level' rather than technically accurate but as an overview of kiara from a very basic user point of view (removed from any consideration of API or frontend), this is correct yes?

Yes, exactly, for the purpose of the mini-app that would be a good set of initial requirements/features that kiara can fulfill. No need to complicate it further IMHO.

The python API - this doesn't need any adjustments, what is already there works great going forward and anything in terms of 'requirements' or adjustments will come once we start building the UI and may have certain questions but for the moment, the API that was built or implemented (whichever the correct term is) for allowing kiara to be used in Jupyter is the exact same that is needed for the mini-apps

The API was specifically not built with Jupyter in mind, and that it works is purely coincidental (and hopefully also a bit because its design is good enough) and personally I'd probably would have written another layer for that specific purpose, but since everyone seems happy with its current state I won't bother. So, just for the record: I don't think the requirement for using kiara via Python/Jupyter are remotely the same as the ones needed to develop a UI. But again, since everyone seems happy I'm not going to argue that point any further.

If the API needs to be changed in reaction of new/unforseen requirements coming from the mini-apps, this might be a problem because how to use it might change, which would have a knock-on effect on the Jupyter usage since that would have to be adjusted as well, docs would get outdated and needs updating, older Jupyter notebooks might not work anymore. I'll obviously put those breaking changes in the changelog, but not everyone reads those, so the experience can still be frustrating for end-users.

assuming that the initial ones (onboarding / tabular) are stable as has been said in the last couple meetings

Not the onboarding ones, I have barely started working on them.

makkus commented 4 months ago

Ah well, might as well go through the list since I can't be bothered to write any more code today. I want to make clear that for a lot of how things should work I don't have an answer myself, it's something I always assumed I'd have some help for figuring things out. Anyway, here goes:

I can import data into kiara, and it (validates? and then) stores it (persistently? where?) with some metadata? with an optional meaningful name?) (but I can't delete it?)

Yes, you can import data. Basically by specifying inputs that refer to 'outside things' to an operation (like a file path/url). Simple data types like bools, integers, strings are imported directly, everything else is basically a byte-array. Since the important bits are the actual bytes (pun not intended), in most cases we are talking about files (either local ones, or remotely downloaded ones). How kiara stores those is abstracted away, and it depends on whether the user decides to 'store' a value or not, if not, depending on the file type the bytes might be stored in a temp file, or in memory. If the user decides to store the value, kiara stores it inside a kiara store (which can either be a folder structure, or a sqlite database -- as I said before this will be documented more in the issue I started about the data export). In addition, kiara also stores the metadata you should have seen by now in some way or other (if not, kiara data explain <value> should give you an idea, also use the --help flag on that to see what more specialized infos exist).

Aliases are just references, meaningful (to the user -- they are choosing them after all), human readable strings that point to a value id. Value ids are globally unique. Aliases can be overwritten, so the same alias (string) might point to a different value id depending on when you look. Those aliases are also stored inside a different store -- an alias store. As I said, its not clear to me whether a frontend would need aliases or not, it's certainly possible to create a UI without those, but it really depends on how you plan to guide the user, how you partition the UI, how/whether you want to provide data management, previews, etc. It's there if you need it, but you can also just ignore it.

Deleting aliases would be possible, it's not implemented, but I can do that quickly if there is a requirement. Just didn't need it so far. Deleting values that have been 'stored' into a store is different, it's a fairly complex problem that I didn't have time to tackle yet, and again I'd probably like to know more about how frontends deal with data and present it to users before I finally get to it. It's difficult because I'm not sure what to do with other values that are before or behind the value that should be deleted in the lineage graph. Again, so far I didn't have a usage situation where this was necessary, and with the upcoming export feature it might become even less important. But again, not sure, it would really depend.

I can use operations to make changes to this data, and the changed data is stored (non-destructively to original data?, with optional meaningful names?) and the changes are tracked (in lineage which I can view?)

No, as I've tried to stress before, data within kiara can never be changed. You can only ever create new values, using old ones (as inputs) plus an operation (a 'configured module'). kiara tries to be smart about how it actually stores the bytes, to avoid storing the same byte-sequences over and over again. This works sometimes better, and sometimes worse, but that's nothing a non-backend dev would need to worry about (yet, anyway, if our storage needs get out of hand we might have to look into it). Again, aliases are tangential, and not necessary for any of this to work, they are just human readable references that point to a value id (at that point in time -- since the aliases could point to something else in the future). As I said above, if values are stored, kiara also stores the metadata of that value, which includes something I call 'pedigree' (basically the direct parent op & input values). This is then used to construct the lineage internally, basically gluing together all the pedigrees until a value doesn't have an op that produced it, but was imported directly.

I can export the data (and lineage? metadata?) when I've made changes to my data (coming soon?)

Again, refer to the issue I started a few days ago, once I have a release I'll update this. No point duplicating the info, and it's not 100% fleshed out anyway. But since you can't change data in kiara this might not be relevant anyway?

I can write my own plugins to provide operations and data types not included in core kiara, if I know python

Yes.

I can share workflows/pipelines/bits of data? with colleagues/myself on a different computer?

Well, a pipeline is just a yaml/json file, so yes, you can share that yaml file. Not sure why we would need that at this point, given our focus on jupyter and the mini-apps, but it's easy to do. The tutorial I've written about pipelines should give a good base for more detailed questions if necessary.

Not sure how you define workflow, we've used that word on different occasions, and I've been guilty to use it to mean different things over time as my own picture of it evolved and I played around with a few ideas briefly. But there is no 'official' workflow feature atm, as far as I am aware. Again, this would probably be shaped by requirements coming from a frontend design decision. Sometimes I refer to a 'workflow (or kiara-) session', by which I mean the time between a kiara context is spun up, and the process it lives in ends. But that's just for technical stuff.

I can create different contexts in kiara, which let me store data in different places? isolated from other data? (why would I want to do this?)

Yes, sorta. That's just one of those accidental features I needed for dev work (separating test data, etc.). I can flesh it out with little additional work if there is a requirement, but this is not something that came from user stories or something like that. Ignore it you you don't need it.

I can configure kiara if I have particular needs (to be different in what kinds of ways?)

Again, this is something I need for dev work, and can be fleshed out if we come across something we want users to be able to change. Unless you have any ideas what that would be, you can ignore it and just keep in the back of your head that there is the option to implement something in that area. There are no user-stories related to that as far as I'm aware.

I can ask kiara things about itself, its configuration and the things it knows about? (why is this useful?)

Yes, and again, whether it's useful or not totally depends on how you decide to implement your frontend. Querying configuration will probably never be useful for a mini-app, but why it's useful to list all operations that it can execute should be fairly obvious, I guess. But really really depends, can't stress that enough. For the streamlit prototype I tried to implement the frontend in a way that it doesn't take anything for granted (which modules/operations are available, which data types, how to render the data types, ...), and does things by investigating the internal kiara objects via the API. If you want to hard-code all the modules/operations you want to use like I think was talked about, it's probably not useful at all, and you wouldn't need most of the endpoints related to that.

This is a sort of low-level feature, that is never directly referenced by user-stories (apart from some doc-related ones maybe), but I consider it a technical means to implement the higher-level features that actually come from user-stories. I might be wrong here, but I'm cautiously confident that I'm not.

I can run operations/"jobs" asynchronously (and therefore multiple operations/"jobs" at once?), but it's complicated to do?

Dunno about complicated. I'd need some actual usage patterns from a real frontend that wants to do things concurrently, but I've prepared some abstractions so that it'll be possible. kiara itself is not thread-safe, but running jobs concurrently would be possible if needed. The more important pattern I've prepared for is non-blocking job submission (queue_job instead of run_job), which I think would be a fairly likely way a frontend would like to interact with a job execution backend. So stuff like display progress/current status etc. is possible, and doesn't block the rest of the event loop. Of course a frontend could just decide to run the kiara context in a thread, but I figured that's probably more complicated. Either way, there is the option.

I can (/will be able to?) make notes on what i did in kiara and why

This is something I'm not sure about at all, so you'll need to ask someone else. I'd need some actual specs from a frontend how this would look like and work, what the notes would be attached to, how editing notes would work, and quite a few other things. So, that would be for you to figure out, but as soon as I get a description (and/or detailed specs), I'm confident I can implement it.

Does this list cover everything kiara can currently actually do?

There are a lot of small features that came up while I developed the more overarching features Caitlin summarized (and which I think are the important ones for us atm). Most of those other smaller things are necessary for development, testing, or some experiments I did but small enough so I did not want to confuse/overload other people unless they become necessary because of a 'real-world' requirement. I didn't throw some of those half-baked features out because I figured there might be a chance they would become relevant in which case I'd continue the work (the API contains a few endpoints like that -- but they would be marked in the code docs).

I guess the more important question for me would be: is there anything else you need for your mini-app? If you have any requirement, just let me know, and either I'll write you some example code how to do it (if already possible), or implement the thing if I can, or tell you why it's not possible or why I have concerns...

MariellaCC commented 4 months ago

I can write my own plugins to provide operations and data types not included in core kiara, if I know python

Yes.

Just a quick comment following our discussions during today's dev meeting. I think that it may be important to differentiate software engineers from module writers here, as both categories can be considered as people who know Python. However the scope of their Python usage within Python ecosystem are two distinct areas of expertise. It is not necessarily about being less/more technical, but it is about areas of the Python ecosystem, which might be more geared towards data analysis/statistics/data science/research for module writers and software engineering/architecture and so on for software engineers. This does not mean that some users don't have both areas of expertise but this may be less frequent given the amount of knowledge required for each of these fields.