DHARPA-Project / kiara-website

Creative Commons Zero v1.0 Universal
0 stars 2 forks source link

What are the future plans in terms of data types? #17

Open MariellaCC opened 7 months ago

MariellaCC commented 7 months ago

Would it be possible to elaborate on the future plans regarding data types in Kiara @makkus ?

From what I understood in the glossary, Kiara currently supports computational data types, and there may be plans to support other data types if I understood correctly from past discussions as well. (e.g. "network data" as a specific data type if I understood correctly from discussions, or "corpus" data type maybe)

Will there be a possibility to also incorporate statistical data types, maybe in the form of an annotation possibility on "columns" (this is just an idea)? Statistical data types are essential to choosing relevant statistical operations for data analysis. And also, for the choice of data visualisations that support the analysis. So I was wondering if this is something that would be potentially foreseen.

makkus commented 7 months ago

Not sure what you mean by 'computational data type' vs 'other data types'. Maybe let me explain a bit about kiara data types and the network data example. I intend to document all this properly, but I'd prefer to do this after we have a first structure of our new docs ready.

kiara manages its own data types (basically a layer above Python types), and those are the sort of 'glue' between modules (output of one module would go as input into another), so they need to be well specified, serializable, cachable, hashable, etc. Characteristics we can't assume from a random Python type. Basically a kiara data type is an information container which describes/guarantees the 'shape' (or information structure -- not necessarily 'data structure') of the data it contains (not sure how I can explain it better), and kiara data types have a few well defined features that can be queried by kiara internally. That's why for example a kiara table can be used ('exported', 'manifested') as a arrow Table, or Pandas Dataframe, or polars Dataframe. It sort of encapsulates the 'tabular' feature, in this case a that means a kiara table is a list of columns, where each column has a name and data type, a length (number of rows -- each column must have the same), etc. A csv file for example could not be represented as a kiara table because it lacks explicit types for each column. But it can be exported from one (since we don't add new information, only loose some-- meaning no user input).

Will there be a possibility to also incorporate statistical data types, maybe in the form of an annotation possibility on "columns" (this is just an idea)? Statistical data types are essential to choosing relevant statistical operations for data analysis. And also, for the choice of data visualisations that support the analysis. So I was wondering if this is something that would be potentially foreseen.

Which data types we would implement would depend on which data types are required to make a set of related modules play well together. It's part of designing the ecosystem of data types and modules within a kiara plugin, like for example the 'network_data' data type went through 3 or so iterations to the form it has now. It went something like this (not complete):

At first, we had a 'graph structure' (I think the data type was called 'network_graph' instead of 'network_data'). The we realized that graphs always come with a specific graph type, but often we did not want to be dependent on that. It turned out that a list of edges and a list of nodes is a better 'information container' for our case, because we can interpret this as all the possible graph types later on. There are still a lot of complications with this, but most of them come down to document well what each interpretation means. Also, in most cases we need the user to tell us how to interpret the data (which graph type do you want?), and in terms of usability it's better to create the generic information container first (because it requires less user input and in some cases we can do it without any), and then have each module that uses 'network_data' as input either asks the user which graph type they want in this instance, or, in some cases it's clear for the module which types it needs, which is even better because we don't have to render an input field.

Initially, I stored the network data in a sqlite database, which made it easy to query some simple relationships, and export it into networkx graphs. It was very convenient to use. Then we started to think about the 'extract.components' modules, which would take a network graph as input, and return a list of node ids (a kiara 'list' data type) that represented the largest component. Then we figured out that we might also be interested in the rest of the nodes, so we returned a 2nd list of node ids. Then we implemented a module that created a new network_data instance from the list of node ids and the original network graph, and it was nice that this would work for both outputs. But still it became clear that we'd almost always do that, because a list of node ids is just not very useful by itself.

So we changed the module to output two network_data outputs, which made the 'recreate from node ids' module unnecessary, and pipelines/lineages had one step less (always a good thing).

We also had other modules that did stuff to network_data, and in a lot of cases they would just add one colum to either the node or edge list. And it became clear that this would mean we'd store a lot of duplicate data in the kiara data store, because sqlite databases can't be de-duplicated very well.

So I decided to drop sqlite and use Arrow tables (which we use in a lot of other kiara modules). Arrow tables can be de-duplicated very well, and if we for example add a column to a (arrow/kiara) table, internally the only thing that gets added is the new column, the result table can be on-the-fly-assembled from the old table columns, and the new one. In the beginning it was not clear that our network_data plugin would have a lot of modules that would fit that pattern, which is why I went with sqlite, after it became clear it made sense to refactor everything, but the context of usual patterns of how data flows within that domain was necessary to figure that out. In addition, now it becomes very easy to just use either the edges or nodes table with any of the modules that take a 'table' as input (query.sql for example), all that is needed is a helper module that 'slices' off the required table from the network_data value.

I expect we'll have similar learning curves in any of the other plugins we'll create, and as I've said before often designing the modules so they play nice with each other and esp. the datatypes involved is the hardest thing. Implementing the module 'process' method later is kids play comparatively.

Sorry for the long story, I hope it does make sense somewhat, happy to elaborate further. It's basically an architectural problem, not a technical one.

So, short answer: yes, we can add any custom data types we want. But we probably want to be really smart about it, and have a good idea what they should guarantee in terms of features, information structure, etc.

Just a guess, but I'd wager that some sort of statistical data types will be a good idea, I'm not sure what you mean by 'annotation', I guess the question will be weather those will just be tables with a bit more metadata (in which case it might make sense to make tables themselves a bit more flexible to support that use-case), or sufficiently different to warrant their own data type (incl. their own data-type name within kiara which would be displayed to users when we ask for input for example). We'll figure that out once we have a set of workflows where those sort of data types would make sense.

MariellaCC commented 7 months ago

Thanks @makkus for this helpful info.

'computational data type'

I meant integer, string and so on.

kiara manages its own data types (basically a layer above Python types), and those are the sort of 'glue' between modules (output of one module would go as input into another), so they need to be well specified, serializable, cachable, hashable, etc.

This is very clear, I understand.

I hope it does make sense somewhat

yes, thank you

I think that we should be aware that there may be a risk of confusion for some users used to do data/statistical analysis, as in the context of a data/statistical analysis, a data type often designates the individual "columns" -also called variables/columns/attributes or features in machine learning- of a table (as opposed to the rows of a table also called observations/records or instances in ML). The data type is meant as the computational data type (string, integer and so on) of the column, and the statistical data type (numeric, or categorical, discrete/continuous and on the nominal/ordinal/interval/ratio scale, or dichotomous ). The explicit description of these data types (computational and statistical) is often the first step of any standard data analysis, as this determines some of the actions that need to be done afterwards, amongst which the kind of statistical analysis that can be performed (e.g. Chi-square, here is an example of Chi-square test in a DH publication: https://journalofdigitalhistory.org/en/article/JJszM3GwAYDs?idx=0&layer=hermeneutics&lh=679&pidx=0&pl=narrative&y=118 ) and the choice of relevant data visualisations for a given data set, as data visualisations are often considered an essential step that is part of data analysis.

So what I am trying to say is: 1) we may need to disambiguate the notion of data types for some of our users as they may be used to the data types as meant in a data analysis context 2) I think that if we provide a way to facilitate data description for users who start their analysis by a data description, exactly as you said by maybe offering a possibility to users to add metadata somehow, there may be some value in offering such a possibility.

makkus commented 7 months ago

I meant integer, string and so on.

To kiara, those are treated the same as more complex types, so in that regard there is no difference at all.

So what I am trying to say is: 1) we may need to disambiguate the notion of data types for some of our users as they may be used to the data types as meant in a data analysis context 2) I think that if we provide a way to facilitate data description for users who start their analysis by a data description, exactly as you said by maybe offering a possibility to users to add metadata somehow, there may be some value in offering such a possibility.

Yes, good point. Not sure how to best do that tbh, I usually try to say 'kiara data type' to avoid confusion, but that is not really fool-proof.

MariellaCC commented 7 months ago

If I understood correctly @makkus, the Kiara table data type can/will potentially "accept" as a valid table, elements such as an Arrow table or a Polars data frame or a Pandas data frame, for example. For the Kiara "array" data type, would a numpy array for example be considered valid? Is there somewhere specific in the https://github.com/DHARPA-Project/kiara_plugin.core_types/tree/develop repo to find out about what elements are considered valid?

Also, I was wondering, at the moment, if one would want to pass a gensim "dictionary" for example (https://radimrehurek.com/gensim/corpora/dictionary.html) or a model (https://radimrehurek.com/gensim/models/ldamodel.html) from one Kiara module to another, would there be a way to do it?

makkus commented 7 months ago

The kiara table data type reoresents all data that is structured in a tabular, typed format, thinking about it in terms of Python data types is probably not a good idea. The implementation details should not matter when thinking about kiara data types, only the 'characteristics' or 'qualities' of the data itself. If a Python data type has those characteristics, or can be transformed to them without user input and loss of information, it falls under the umbrella of a kiara data type, basically.

On a technical level, that transformation has to happen somewhere, and needs to be implemented of course, but that's about it: an implementation detail. When thinking about data types, I think it is a good idea to try to get our users to only think about the structure of their data, for module creators that's obviously a bit different, but even thouse would ideally not focus too much on the Python data type.

Anyway, that being said, here is the implementation of the kiara 'table' data type, and esp. the point where a Python object is converted into it: https://github.com/DHARPA-Project/kiara_plugin.tabular/blob/cf556c41fb590d91aed5b7af7c112069bf895941/src/kiara_plugin/tabular/data_types/table.py#L42 (KiaraTable being the actual Python class that wraps the data type functionality that can be used by a module creator and that you would receive if you call the .data attribute on a Value object.

This function basically just forwards to here: https://github.com/DHARPA-Project/kiara_plugin.tabular/blob/cf556c41fb590d91aed5b7af7c112069bf895941/src/kiara_plugin/tabular/models/table.py#L23

Similarly, here is the respective code for the 'array' data type:

The main work those implementations have to do is handle how the actual 'raw' data/information is serialized as well as de-serialized, and provide common utility methods that are likely required to use data of a particular type within a module (for example 'export' as a polars dataframe, or run a sql query on it). All in the most practical and efficient way, taking into account memory overhead, caching, etc.

I haven't tried it yet, but in theory numpy arrays should already be supported as transformation sources for kiara arrays. This code:

            try:
                array_obj = pa.array(data)
            except Exception:
                pass

would be responsible for that, so feel free to try if it works and if not we'll fix it.

As for the gensim example, you'd have to 'define' the data you want such a data type to contain. In theory, a 'normal' kiara dict could probably contain the gensim 'dictionary' (from a first look). Having your own custom data type would make it easier to trust the data within though, because you can parse and validate the incoming data when creating the value, instead of having to do that every time you (re-)create it from a kiara dict within a module. I don't have a good answer as to when to create a kiara data type and when not. As with modules, we don't want proliferation of too many types, because that gets confusing for users, and it makes it harder for different modules to work together (they need the same type in the input/output fields that get connected).

For the ldamodel, we'll have to look into how that is implemented on a lower level, basically 'how the bytes are arranged', and what the access patterns are. And then decide whether we could fit it into one of our existing data types (maybe a table or sqlite database?), or whether we should create a new data type.

In both cases, as a first step, it would probably be a good idea to define exactly the qualities of each data types. Meaning without knowing anything about the implementation, if you have data of each type, what exact information does it need to store, and what exact information do you need to access when you get such data as input in a module. So maybe you could start with that? As I said, don't think about the existing Python types, think higher level and describe the qualities with words, rather than code. As a second step, figure out how other libraries do it, whether they have a 'similar' Python type to hold that type of information, as it's likely we'd want our wrapping data type be able to transform back and forth, and also the implementation of those could guide us somewhat in figuring out the qualities in the first place.

All of this is non trivial, and more of a software architecture task than development initially, but if you start with the two steps I described above we can go from there. Also -- to give some context -- a lot of what you figure out there would go into the (user-facing) documentation of a data type. As I said, we don't want users to think about Python classes, we want them to think about the structure of the information/data they deal with.

How often we'll need to create a data type within a plugin for a specific research domain, I don't really have enough experience myself, so can't really say with any level of confidence. That's one of the main reasons I was being so persistent about building up a collection of different, real-life workflows, because that's the best way I can see to 'extract' what data types we'll need. Before starting to implement the modules themselves.

makkus commented 7 months ago

Ah, forgot to mention: also make a list of all the modules (existing or planned) that take the type you deal with as input and/or output.

MariellaCC commented 7 months ago

Thanks for your answers, I will try with that in mind. Concerning topic modeling, a recap of the modules list is available via the 3 issues here https://github.com/DHARPA-Project/kiara_plugin.topic_modelling/issues. The list is being refined as I go.

MariellaCC commented 7 months ago

There might also be a question about pre-trained models and how to store them if a user would want to do that. But this use-case needs to be confirmed, so at the moment I did not emphasize it, but this is something that I plan on clarifying as well with the team.

makkus commented 7 months ago

Ok, so, one thing that would be good to know would be how 'similar' those models are, and whether we can fit them into the same data type, or whether we need 2 separate ones. Having only one would of course be preferable, but whether that is possible should be fairly clear fairly soon, by thinking about the inputs of the modules the models would be used in. Does it make sense to have either ones of those as input, or does only one make sense? Those kinds of things...

MariellaCC commented 7 months ago

Sorry, I am not sure I understand the "2 separate ones", what would be the two data types in such a case for example? or what would potentially make it necessary for two types?

makkus commented 7 months ago

'pre-trained models' and the other one you meantioned earlier. Or are they the same data? Sorry, I haven't really looked too much into any of that.

MariellaCC commented 7 months ago

Right, no, they are two different things; there will be more precise information soon on these specific use cases via the https://github.com/DHARPA-Project/kiara_plugin.topic_modelling repo. In the meantime, I will proceed with the general information you provided here about data types, and we can refine it afterwards if that's ok.