Column names/tags - Githubissues

kahaaga commented 1 year ago

Describe the feature you'd like to have

Perhaps this should be two separate issues, but here it goes:

Column names

It would be really, really nice to have labelled columns for a dataset, so that we could index by name. I imagine something like what is done in https://github.com/SciML/LabelledArrays.jl could work.

When constructing a dataset without giving names, names could just be assigned automatically x1, x2, ..., xN. When horizontally concatenating datasets D1 and D2, names could be assigned as D1x1, D1x2, ... D1xN, D2x1, D2x2,..., D2xn. In general, column names should be inferred from the variables from which the datasets are constructed.

Indexing columns could then be done as x.colname.

Is this possible to do without too much work? I understand if you want to keep the Dataset construct as lightweight as possible.

Arbitrary tags

If the above is possible/wanted, could it also possible for the user to assign arbitrary tags to columns? This came up when I was implementing a mixed mutual information estimator in CausalityTools. For discrete variables (i.e. integers), this estimator does a certain thing, and for continuous variables (i.e. floats) it does something else.

Because I need a mix of discrete and continuous variables in the same Dataset, I need to convert all values to floats first. But then there is no obvious way to programatically distinguish the columns that contain discrete data from the columns that contain continuous data. The only way to do so is to keep track of which columns represent which types of data manually. Therefore, the user has to manually specify which columns as an estimator parameter. It would be much nicer if this could just be inferred directly from the Dataset instance, by investigating what tags a certain column has.

For example, add_tag(x::Dataset, colnumber::Int, tag) could simply do push!(x.tags[colnumber], tag). Then, when needing to check whether a column has a certain tag, one just checks tagtolookfor ∈ x.tags[colnumber].

If possible, sketch out an implementation strategy

For the column name feature, I don't know how this would be done at the moment, as I haven't delved too deep into the code of LabelledArrays yet.

For the tags, it would be as simple as adding a mutable field tags to Dataset.

Do any of these features sound like something that belongs in this package?

Datseris commented 1 year ago

sure, but probably better of as a new type NamedDataset that has a field NTuple{D, String} as the names. The names are given all together along with the data. Mutability of the names is not important and isn't really worth sacrificing performance for. Making a new dataset with new names is free if one really needs to change name after creation.

Yes for default names x1, .... xD.

Datseris commented 1 year ago

everything in the AbstractDataset interface has compiler-deducible size, so the names must also behave the same way and can't be pushed/popped.

kahaaga commented 1 year ago

Mutability of the names is not important and isn't really worth sacrificing performance for.

In this particular use case there is no need for mutability. I simply need a way of saying "this column/variable has label X" upon creation.

Datseris commented 1 year ago

Given that the API of the package is done for the abstract type, making this is very easy and hence a good first issue. Only thing necessary is the extensions of accessing columns by name.

JuliaDynamics / StateSpaceSets.jl

Column names/tags #6

Column names

Arbitrary tags