DHARPA-Project / kiara-website

Creative Commons Zero v1.0 Universal
0 stars 2 forks source link

Internal kiara ids, hashes, and job caching #36

Open makkus opened 4 months ago

makkus commented 4 months ago

This is a bit more low-leve documentation, but probably useful for someone developing a non trivial frontend, and other features like notes etc. This is just an overview over some of the strategies and issues I've been dealin with, there are some unanswered questions in all this, so don't treat this as a 'that's how we do things now and in the future". It's really just some background info that explains how some things work, why some things are difficult, and what questions are still open. It's also fairly technical, so feel free to ask questions so I can clarify what is unclear. Or just ignore all of this if you can't see any value in knowing about it.

Internally, kiara uses globally unique ids (of type uuid4 -- still wondering whether uuid7 might maybe be better fit so interested to hear opinions) to identify values, as well as jobs.

Values

Those are the main entities around most of which kiara revolves. kiara needs a way to refer to a specific piece of data across time. Variables are not good enough, because the a) can get overwritten, and b) do not persist across sessions. Every value gets an id assigned when it is created, since I'm using uuids we can be reasonably sure that no id clash will ever happen, and the id will be globally unique across all invocations of kiara in the history of mankind. This is an important assurance.

Data hashes

Each value can also be identified by it's data hash. This is a (reasonably collission-resistant) hash using the binary-serialized form of the data, and should in all practical cases also be unique for any given dataset. We can't use only the hash as a value id, because one of the most important things the Value type does is wrap the actual data and store metadata alongside it. Two pieces of data with the same hash can have differing metadata.

Imagine an add module, with two integer inputs. If it gets run with '1' and '4' the result is '5', if it gets run with '2' and '3' the value is also '5'. Both '5's have the same hash, but different pedigrees (the module used to create it + the inputs). If we'd only used the hash of a dataset as its id we would not be able to handle that situation.

Nonetheless, hashes are useful. For example, if we want to store a result that has the same hash as a previously stored result, we don't need to store it again, and can save on disk-space (what actually happens in kiara is more complicated, but this is a good enough way to think about it).

Job records

Everytime kiara runs an operation, it gives the job an id (also uuid4), it executes the operation at its leasure, stores the result (either in a temp file, or in memory), and it records metadata around that operation. Some of it is not technically necessary, like time the job was run, how long it took, some (optional) log output. The one important bit that kiara needs is the job id. When you run an operation you get back the job id. You can do other stuff in the meantime, but at some stage you wonder how that job went. Then you can use the job id to ask kiara if it has finished in the meantime, if it was successful, and to give you the results. The API endpoint run_job hides all that from you, under the hood the queue_job endpoint is used, that's the one that returns the job id.

Unless you store a result value, kiara will not remember job records from previous sessions (since even if it would know it has run the job before, it would not have access to the result values anymore, nor possibly even the inputs to the job, after all, those are not stored). But if you do store a value, kiara will make sure to store all job records for everything in its lineage. Again, this is not technically necessary because the value pedigrees/lineage are stored in the data store as well, and the pedigree part of the value and job metadata is the same, but job records also store some additional information that can potentially be useful later (along what I already mentioned above):

Ignoring the first one, the other two are useful, because we can use them to look up whether jobs already ran and potentially not run them again but just re-use the result from that previous run. At the point before we start a job, we can compute either of those two hashes, and then compare it against a database, if we find a match, we can decide to skip execution and just return the value ids for values that are already stored in the kiara data store. Whether to use one or the other hash is a decision that has some impact, which I will outline further down below.

There is a bit of additional complication to all this. For one, what happens if an input to one of your jobs is a path to an external file (local path, or url)? One problem is that if you run that same job twice, kiara will always issue a new value id for the string input that contains the path/url. So, that rules out the possibility to use the job hash that contains the input-value-ids to compare against previously run jobs. If we registered/stored that string into the kiara data store in a different step, and then used the reference (value-id) to that value as input to our job, then we could run that job as often as we liked, and would always get the same hash.

The other, more difficult problem to solve is this though: we can never be sure that the content of the file we are using as input hasn't changed in the meantime. Which means even if we stored the path input in a store beforehand, we could end up with a situation where kiara does not run the same job again, and returns an old/out-dated result, because the user changed the content of its input in the meantime. One solution for this would be to not register/store the path to the file into the kiara store, but the actual file (incl its byte content). That way, if we use the value id of the registered file, we can always be sure that it hasn't changed, since nobody but kiara has access to it.

Circling back to using a string as input: if we did that, and ran the same job multipiple times, and the content of the file hasn't changed, we would get different value ids for all the input fields across those jobs, but the value hashes of all inputs would be the same. Similarly, we would also have different value ids of the jobs result values, but again, the hashes of the result values would be be the same. Which means that the hash described in the 3rd item in the list above (the one containing the 'input-data-hashes') would also be the same (nothing is different here between runs), and could be used to save us from running the job in the first place. We need to calculate the hashes of all input file(s) (so we'd always prefer the hash that uses input-value-ids), which is a bit of work, but overall this could work for the purpose of saving on compute time.

Assuming we do this, and have a job that would match based on input hashes. This opens up the question: for the result values, should we re-use the value-id we have on record for the result and return those, or create a new value id, update the time the value was created, but copy most of the other metadata in the job record? The former one would make job-matching easier in the future, because we'll come across more cases where the hash derived from the input-value-ids matches, the latter would contain the right timestamp, and would allow to attach different metadata to both values (since we have two different identifiers we can attach stuff to).

We have to deal with the fact that the value-ids are different across runs though, and that has an potential impact on some frontend features, I could imagine. What if you want to store some piece of metadata to a result value, but the next time you run the same job, that value id is different. What do you attach the metadata to, so you can find it again?

So, one of the problems we face is that there is no 'strategy to rule them all': should we use value ids for hashes, or input hashes, or just ignore all of that and just use the benefit of de-duplicating the stored data and not cache job runs at all? When should we use which strategy? Also, this is probably something users need to be aware happens on a case-by-case basis, if we don't always do the same, since they'd get arbitrary behaviour (from their point of view).

Another thing I haven't even started to strategize about: what should when we update plugin packages? The job record (as well as value metadata) contains information about the environment it ran in (e.g. all Python packages), so we can see whether that changed. In this case, should we never re-run cached jobs, or always? The module contained in the plugin could have changed in the process method, or an updated dependency to the plugin could change some behavior, but in a lot of cases that probably won't have happened. Theoretically we could hash each kiara module Python class, and store that info, but that would not really help because the changed behavior could originate in a utility method somewhere else. Or we could assign versions to each kiara module, that a module developer assigns manually, to indicate to kiara that if it encounters a higher version number, it should re-compute. This last one would probably be the best way of doing it, but it would involve module developers to be aware of this issue.

For now, I have disabled job matching completely, because that provides the least surprising/most consistent user experience. But I'm still experimenting with the different options, and also ways to make that potentially configurable via the frontend, on a job-by-job basis. That means, that with every job you run you get different value ids for the results, always even if you run the same job twice in a row, without doing anything in between. So, that's the one thing to be aware of, for now.

Another open question is whether to add job records to the exported kiara archives. There are pros and cons here as well, but at the moment I'm leaning towards including them.