`metadata` method - Githubissues

Tokazama commented 4 years ago

It is often the case that one wants to attach metadata of some sort to an array/graph/etc. How do people feel about adding something basic like metadata(x) = nothing that can then be extended by other packages?

bkamins commented 4 years ago

@pdeffebach had some good design ideas about it in DataFrames.jl in the past. Now, finally after 0.21.0 release, we are planning to add this functionality to DataFrames.jl.

As this is raised on a higher level let me give the API I envision for DataFrames.jl for now:

metadata(::DataFrame) returns a Union{Nothing, Dict{Symbol,Any}} that - if filled - gives a DataFrame-level metadata (this can be arbitrary metadata). The restriction would be that symbols starting with DF_ in their name would be reserved for internal use of DataFrames.jl (as a convention)
metadata(obj::OTHER_TYPES_DEFINED_BY_DATAFRAMES) = metadata(parent(obj))
metadata(::DataFrame, ::ColumnIndex) returns a String (by default nothing) - which would indicate just a verbose name of the column, with default being just column name
metadata(obj::OTHER_TYPES_DEFINED_BY_DATAFRAMES, ::ColumnIndex) similar to the above if the column is present in the other type.

If we agree to this design then I can implement it. The key challenge is rules of propagation of metadata, but this is not DataAPI.jl related thing so I leave this discussion for later.

CC @pdeffebach, @nalimilan

nalimilan commented 4 years ago

See https://github.com/JuliaData/DataFrames.jl/pull/1458 for the last attempt at implementing this in DataFrames. Two points:

I think we need something more general than just having a custom label/verbose name for columns. For example it could be useful to store units, informations about measurement, etc. Label can just be a standard field among others.
We also need an API to set metadata and to retrieve the list of fields that have been set.

In general a choice has to be made between having 1) a single function in the API which would return a metadata dict which would have to implement specific methods (getindex, setindex! and keys notably); or 2) several functions in the API that would allow doing these operations directly. See the table in my first comment at https://github.com/JuliaData/DataFrames.jl/pull/1458. I think returning an object is simpler since it allows reusing the standard dict API.

Tokazama commented 4 years ago

Metadata could technically be stored at any level of something like a table. For example, each column could be a MetadataArray (i.e. from MetadataArrays.jl) and the table itself could have metadata. I worry that if we started trying to design this around column based indexing it would needlessly complicate and potentially limit its wider usability. Even the definition of what "metadata" is to different people is likely to vary so I'm not sure we should even guarantee it returns a certain type.

bkamins commented 4 years ago

Initially I wanted to write that metadata(::DataFrame, ::ColumnIndex) could also return a Dict{Symbol, Any} - for me it would be OK. In this case there should also be some namespace of resrved key names for internal use.

So personally I would prefer the "single function that returns a metadata dict" approach and later the user can just work on the Dict.

Ah - and now I see we could support metadata(::DataFrame, ::ColumnIndex) that would return a NamedTuple of dictionaries associated with columns.

I agree with @Tokazama that different people will want different things from metadata therefore I believe the API we provide should be maximally simple and flexible. Therefore I would prefer to think that metadata is just a Dict and there is one global dict for a data frame as a whole and then each column can have column specific dicts. Then the rest - how to work with it - would be delegated to a decision of the user.

Tokazama commented 4 years ago

Allowing the user to decide what to do with whatever metadata return also provides the freedom to further specialize on this later. For example you could always do something like colmeta(df, col) = metadata(df)[col] and then you wouldn't have to worry about reserving key names.

Would a simple PR to DataAPI.jl on this be a good next step right now?

bkamins commented 4 years ago

One will probably need to reserve key names anyway. In particular I do not think that metadata(df)[col] to return a metadata for column col is a good API (if we allowed this then there would be no way to specify global metadata for a table as a whole).

I think this is such a major thing that we should wait for other JuliaData members to comment before moving forward.

nalimilan commented 4 years ago

Maybe we can say that metadata(tbl) and metadata(tbl, col) have to return objects implementing the AbstractDictAPI, giving respectively the table-wise and column-wise metadata? That should be flexible enough for all implementations.

(In practice for DataFrames we would probably store column metadata internally as vectors with one entry per column as https://github.com/JuliaData/DataFrames.jl/pull/1458 does but that can be exposed to users via a lazy AbstractDict object per column. What could be useful to provide in addition is a way to access these vectors for convenience/efficiency.)

bkamins commented 4 years ago

have to return objects implementing the AbstractDictAPI

Agreed

In practice for DataFrames we would probably store column metadata internally as vectors with one entry per column

We can discuss what is best in the PR for DataFrames.jl when it is done (essentially we have two options: dict of vectors or vector of dicts).

bkamins commented 4 years ago

As have been thinking about this issue and #1458 I came to the conclusion that we should go back to the fundamentals. And the core issue is:

It is often the case that one wants to attach metadata of some sort to an array/graph/etc.

What I mean that while we seem all agree that adding metadata to tables is needed, actually I would discuss first what kind of metadata we really think people would store in practice. This is a relevant questions as I think we should not create a functionality that later would be very rarely used. Conversely - if we know exactly what we actually want to use we can design API that supports the required use-cases cleanly.

My two concerns are:

persistence; most storage formats will not allow to save and load this metadata; which means that, at least in my understanding, the use cases, where people will use metadata will be situations of non-persistent metadata (i.e. something you attach to your table temporarily for programming convenience)
performance; we do not want to kill the performance of basic operations on tables, because the "table processing engine" would constantly check if metadata needs updating, or if the cos of updating the metadata would be large, or if the memory footprint of allowing to store metadata would be non-negligible (ideally if there is no metadata then the performance should not be affected)

So now let me go down to the starting question - what metadata we see that would be actually used (this is not a comprehensive list - please comment what you think would be really used - not just potentially used):

metadata for handling how data frame is shown (things like overriding show defaults, maybe custom decimal delimiter, maybe - if in the future we add integration with PrettyTables.jl, some settings of all the options that package provides)
custom column labels (possibly shown when printing a data frame) - this is tempting, but has a problem with "persistence" - i.e. most formats will not store this information, so my question is - how in practice we see that people will use this functionality?
custom row lables labels - the same situation
setting flags that some columns should be treated in a special way, e.g. for geospatial or time series analysis of data frames - this is tempting, but myself I am not 100% convinced it is a good idea, as it will be hard to ensure the metadata is consistent with a parent table (e.g. GeoPandas uses this strategy, and it has a problem with ensuring this integrity)

Tokazama commented 4 years ago

In addition to what you've mentioned here are some types of metadata that I think would be useful for me personally to be able to store:

Source and collection information: I often have many tables that have some acquisition metadata pertinent to them that is not row or column specific but describes important aspects of all the data in one table.
Column tracking: When performing semi-automated feature creation I like to keep track of certain operations/parameters/weights that resulted in the formation of a column of measures.
Attaching metadata to a column that changes how it dispatches later on

it will be hard to ensure the metadata is consistent with a parent table (e.g. GeoPandas uses this strategy, and it has a problem with ensuring this integrity)

I think it depends on how much you care to take ownership of handling all metadata. I would prefer handling metadata be given a minimal interface. It could potentially have a methods for things like joins so that something like join_metadata would also join dictionaries but could be taken advantage of for custom metadata types.

I also think that I/O on metadata should be entirely dependent on the package supporting I/O. There aren't many file types equipped to flexibly handle metadata and it seems like the best thing for DataAPI.jl is to just make it simple to extract metadata.

pdeffebach commented 4 years ago

I agree with all that is said here. As the author of one of the previous attempts I think that meta-data is important and people coming from R and Python often don't fully appreciate how useful metadata is for Stata users and how it has hurt the adoption of R in applied economics, especially household surveys.

custom column labels (possibly shown when printing a data frame) - this is tempting, but has a problem with "persistence" - i.e. most formats will not store this information, so my question is - how in practice we see that people will use this functionality?

My use of metadata in Stata was twofold

Pretty printing of columns and keeping track of data. For example, the table below would not have been possible to make programattically without extensive use of column labels. I couldn't imagine trying to write this in R because column metadata in R doesn't persist after joins.

Screenshot from 2020-05-24 11-05-32

Keeping track of the data-cleaning process. In the above table, the variable "Standardized income index" is composed of the 3 variables below it. The note for that variable will tell us as much, and was automatically generated. If you were to type

note list standardized_index

you would see a note that said something along the lines of

A standardized index of 3 variables: net_earnings, consumption, durable_assets

Stata also has metadata about a table, which is often used to denote a source or author. I never used that feature.

With regards to IO, I don't see a huge problem with saving a data frame to two CSVs and providing a convenience method for adding metadata to a DataFrame when the metadata is stored as a Table. Maybe it's a bit heavy handed but it's robust.

Tokazama commented 4 years ago

I think these are all great use cases that I've wanted at some point. As someone who deals with lots of different types of metadata I'd really like to emphasize that less is more as this is implemented. It's easy to get stuck in the weeds on every little implementation detail because you have the combination of situations that arise from row specific, column specific, and general table metadata and all the different types of metadata.

This is loosely the kind structure I'm considering using...

struct Table{T<:AbstractVector,M<:Union{Nothing,AbstractDict{Symbol,Any}}}
    data::Vector{T}
    index::Dict{Symbol,Int}
    meta::M
end

metadata(x::Table) = getfield(x, :meta)

Users don't have to ever worry about metadata unless they decide they want it and developers can create whatever type of fancy metadata that changes dispatch as long as it is a subtype of AbstractDict{Symbol,Any}.

I would think column specific metadata would be easiest if implemented as a column vector with metadata so I could just do metadata(table.column_name). Otherwise concatenating/merging/joining columns becomes the responsibility of DataAPI.jl. Seems more simple for this to be implemented at the level of something like vcat(::MetadataVector, ::MetadataVector).

nalimilan commented 4 years ago

I agree with most of what has been said. Just one point:

I would think column specific metadata would be easiest if implemented as a column vector with metadata so I could just do metadata(table.column_name). Otherwise concatenating/merging/joining columns becomes the responsibility of DataAPI.jl. Seems more simple for this to be implemented at the level of something like vcat(::MetadataVector, ::MetadataVector).

I'm afraid this wouldn't be workable, as it would require users to deal with another new kind of vector just to store metadata. That would force recompiling all functions for that type, and it wouldn't be easy to deal with e.g. CategoricalArray. It would also make things appear more complex for users who load files with metadata (e.g. Stata files with column labels), while one of the strengths of DataFrames is that they just wrap standard arrays.

We can say the table type is responsible for preserving metadata across concatenations/joins. DataAPI itself doesn't have to know anything about that.

pdeffebach commented 4 years ago

With regards to spatial data, which is a natural use case of this, is there anyone in the Julia Data community who has a really detailed knowledge of R's sf package?

It's the best thing ever, being able to use all of dplyr while also maintaining spatial metadata and using spatial joins etc. is incredible.

Perhaps someone who has worked on that project could provide some insights.

bkamins commented 4 years ago

The project is going to be done during JSoC this year. And one of the reasons I am pressing to decide on metadata now is to have a clear guidance how this extra package should integrate with DataFrames.jl.

quinnj commented 4 years ago

I'm a little slow/late to the discussion here, but have thought a bit about this. I agree with the idea that this is a way that Julia/DataFrames can really stand apart/improve on the situation from R/pandas; having useful metadata integrated w/ a DataFrame could be really powerful when used in the right contexts.

That said, I worry about some of the suggestions around metadata use because they start to become so fundamental or logic-driven. IMO, if some kind of data starts to become so critical we're changing how things are computed/etc. then it probably deserves a more structured solution that just a metadata entry in a DataFrame.

IMO, metadata should be primarily "descriptive" about the object; give context, explain values and cardinality thereof; tweaking printing/showing seems fine to me. I just worry about packages starting to abuse metadata when they should really be creating a new AbstractArray type or something (I mean, you could imagine someone trying to implement CategoricalArrays by just using metadata).

My other thought is that while I agree that DataFrames can do a tight integration w/ metadata, I do thing we should allow/encourage metadata to be attached/used generically on objects, including columns. There are going to be a lot of cases across the ecosystem where you're not dealing w/ a DataFrame, and it will be useful to support metadata in a variety of ways on columns, rows, etc. But yes, DataFrames can choose how it approaches its use/integration w/ metadata, either at the table-level or column level.

bkamins commented 4 years ago

I have discussed:

I just worry about packages starting to abuse metadata when they should really be creating a new AbstractArray type or something

with @visr with the context of geospatial data (temporial data is the same I think) and we came to the same conclusion. The logic in packages using tables should primarly be based either on type or a trait of a column (trait is probably preferable as currently Julia does not allow for multiple inheritance), but not metadata attached to it.

So given this - are there any more comments how the reference API should look like?

quinnj commented 4 years ago

So I'm not sure what exactly the proposed API is? Is it just that metadata(x) returns Union{Nothing, AbstractDict}? Here are a couple thoughts/ideas:

I'm not sure we should require AbstractDict specifically vs. "an object that supports AbstractDict methods" (or as I like to call it, AbstractDictLike); namely it'd probably be nice to allow NamedTuple to be returned from metadata, which isn't an AbstractDict, but does support the interface; it'd be good to be very clear about what exactly is required of the object returned
If we're thinking of requiring Union{Nothing, AbstractDict-LIke}, I wonder if we should just require returning an "AbstractDictLike" and we can return an empty one by default; we could then provide convenience get/put methods. Alternatively, we could not require a specific object type to be returned and just have hte interface be metadata(x) and metadata!(x, meta). I kind of like the idea of requiring AbstractDictLike and returning a NamedTuple() by default
In terms of implementation, I've been looking a lot at how @doc is implemented in Base and I think it could make a lot of sense to do something similar for metadata; that is, instead of modifying DataFrame to have a metadata field, there'd be a global (or per module) metadata IdDict that could store metadata per object. That would allow attaching metadata to all kinds of objects w/o needing wrappers. I think it also helps reinforce the idea that it's metadata, or somewhat detached from the object and not to be too relied upon for program logic. Along this vein, it could make sense to make an entire Metadata.jl package that copied the Base.Docs implementation and could be used by packages everywhere. It would be pretty lightweight, but could provide a lot of flexibility and a clean, standard API that other packages can integrate with. If we want to go that route, we probably don't need a definition in DataAPI.jl

bkamins commented 4 years ago

Along this vein, it could make sense to make an entire Metadata.jl package that copied the Base.

Actually I would prefer this idea as it would be much more composable. The consequence for DataFrames.jl users would be:

metadata will not propagate when the object is copied.
still it will propagate when it is just passed, not copied.

So: df.col will keep the metadata of col and similarly df.col = x will make col to have the metadata of x.

nalimilan commented 4 years ago

Ah yes that's interesting. Indeed it's quite convenient in R to be able to attach metadata to any object, and yet in Julia we don't want to have to wrap any object in a special type just to add metadata.

Though losing the metadata on copy would be annoying. That could easily be fixed in DataFrames by ensuring we copy/readd the metadata when copying the columns (this would be needed for important cases like getindex but also select without transformations). But it's not easy to fix when the user calls copy on an arbitrary object: doing this may be too costly for small objects, and for large ones it would require support from all packages (including Base...). Maybe it's not the end of the world though if one has to do copywithmetadata(x) when needed?

(Otherwise, returning an empty NamedTuple by default (instead of nothing) sounds fine. We really need traits in Base!)

bkamins commented 4 years ago

Though losing the metadata on copy would be annoying.

Personally I would feel safer if we worked this way. I would prefer to have a function that copies medatada explicitly that can be called if someone needs it.

Tokazama commented 4 years ago

There are a lot of packages that use the term "metadata" (e.g, ImageMetadata.jl, MetadataArrays.jl, MetaGraph.jl, FieldMetadata.jl, FieldProperties.jl, etc.). I don't think an interface like Base.Docs is flexible enough to fit many of the potential uses of metadata.

quinnj commented 4 years ago

@Tokazama can you explain a little more why you think the Base.Docs approach wouldn't be flexible enough? In terms of approach, it's more of an implementation detail: the user interface would still be metadata(x), it would just retrieve the metadata from a per-module store instead of retrieving it from the object itself.

Tokazama commented 4 years ago

It wouldn't carry any type information so if someone did use something like a NamedTuple it wouldn't really help any.

quinnj commented 4 years ago

Sorry, I'm still not following the concern. Why/where would type information be important? The discussion has revolved around metadata(x) returning any kind of object that implements the AbstractDict interface, so in practice, you would use metadata like:

meta = metadata(x)

# see metadata keys
keys(meta)

# iterate over metadata key-value pairs
for (k, v) in meta

end

# check if a specific metadata key is present
haskey(meta, :specific_key)

So depending on whether metadata(x) returned a Dict, or NamedTuple, you would have different implementations of these methods, but the interface is still the same.

We should probably require that the object returned be AbstractDictLike{Symbol, Any}, i.e. require that metadata keys by Symbol; does that sound reasonable or too restrictive?

bkamins commented 4 years ago

Actually I prefer metadata to be flexible and type unstable.

Apart from convenience it is a clear signal for the developers not to use metadata to encode program logic - Julia provides other means to to this efficiently.

Metadata, as I think about it now (but my opinions evolve based on the comments we get here as the design here is not an easy decision) should be for lightweight things like descriptive strings or maybe some hints how output should be formatted (as working with IOContext is hard for most users and in some cases it it not flexible enough as IOContext is not always usable - e.g. you cannot replace stdout with a custom IOContext AFAIK).

Tokazama commented 4 years ago

Actually I prefer metadata to be flexible and type unstable.

I'm not against this being the case for specific implementations like what might be done in DataFrames but I don't think it should be the only option.

pdeffebach commented 4 years ago

Apart from convenience it is a clear signal for the developers not to use metadata to encode program logic - Julia provides other means to to this efficiently.

I agree. You don't want too many interfaces relying on specifically named metadata fields to create unnecessarily complicated features.

So: df.col will keep the metadata of col and similarly df.col = x will make col to have the metadata of x.

I don't fully understand this. IMO metadata should be attached to a data frame and df.col should always return a vector without anything else attached to them.

I agree with @nalimilan, it would be annoying to have metadata disappear with copying, consider something as simple as

df.income = clean_vec(df.income)

clean_vec takes in a Vectof{Float64} and for whatever reason has a concrete type signature. So no meta-data is added.
If I have a label for df.income, , say "Personal Income", I don't want this to disappear. Keeping track of these operations would get tiresome.

This sort of global Dict that contains metadata is basically how Rs metadata system works, and I've always found it very useless, partly because the metadata disappears.

quinnj commented 4 years ago

@pdeffebach, I think there's a lot more flexibility in the Base.Docs system than just thinking of it as a "global Dict". With Julia's rich type system, macros, etc. I think we could easily accomodate scenarios where you want to attach metadata to a DataFrame column, and not the Vector itself, but to a named column of the DataFrame, which would "stick" beyond transformations. And as has been mentioned, there are a number of scenarios where you don't want metadata to stick around too much, if you're creating new objects and such.

As I've played around with ideas/implementations, I just don't see a realistic way to make a system that is general enough to be widely used that relies on either wrapper objects or requiring metadata fields. It just doesn't scale. The doc system, however, is extremely rich and accomplishes its goal/job very well, IMO; attaching extra information to types, variables, fields, etc. Part of my experience/opinion here is coming from thinking through the entire data ecosystem, not just DataFrames. While I think DF is one of the primary targets for a metadata system, I also want to ensure that other table types, formats, and objects can also take advantage of a metadata system to enhance objects.

nalimilan commented 4 years ago

I agree with @nalimilan, it would be annoying to have metadata disappear with copying, consider something as simple as

df.income = clean_vec(df.income)

1. `clean_vec` takes in a `Vectof{Float64}` and for whatever reason has a concrete type signature. So no meta-data is added.

2. If I have a label for `df.income`, , say `"Personal Income"`, I don't want this to disappear. Keeping track of these operations would get tiresome.

@pdeffebach What kind of operations would be performed within clean_vec? Apart from copy, which you could replace with (say) copywithmetadata, I'm not sure many operations should/could preserve it. In general I don't see what solution we could find a system in which both 1) metadata is preserved on copy and 2) metadata can be added to any object (e.g. Vector). What we can do, though, is to have DataFrames operations copy metadata automatically where it makes sense -- but that doesn't include custom functions since we have no way of knowing if you are just cleaning the income value or creating a completely new thing.

Or maybe in your example you meant that assigning a new vector to an existing column via df.income = v should preserve the metadata of the column? That makes some sense but could be problematic if you really want to replace the column (and you may even not know some previous column existed with that name).

pdeffebach commented 4 years ago

@pdeffebach What kind of operations would be performed within clean_vec? Apart from copy, which you could replace with (say) copywithmetadata, I'm not sure many operations should/could preserve it.

Yes, even replace performs a copy. From a users perspective, this kind of behavior would require a lot of defensive programming and reasoning just to get persistence in metadata that imo is the most intuitive.

we have no way of knowing if you are just cleaning the income value or creating a completely new thing.

In my view, metadata isn't a property of Vectors, its a property of named columns in a data frame. Exactly what object df.income points to in memory is an implementation detail, the point is that because I assigned it to the column :income, I want metadata(df, :income) to return "Personal Income".

Ultimately, if the user wanted different metadata for df.income = clean_vec(df.income), they would have assigned it to a new column.

I understand the appeal for a solution that the whole data ecosystem can use, but I hope it allows for the intuitive use of metadata that Stata has, which persists across reassignment, copies, joins, etc.

nalimilan commented 4 years ago

OK. Whatever the chosen implementation, DataFrames could certainly preserve the metadata of already existing columns in setindex!/setproperty! if we wanted. It's more a matter of deciding whether it's a good idea, but better discuss this elsewhere.

quinnj commented 4 years ago

And perhaps I wasn't super clear in my previous comment, but I was thinking along the lines of being able to do:

@meta "Personal Income" df.income

which would do a nesting metadata attachment of "Personal Income" to the income column of the df object. i.e. the metadata would be attached to the df.income array itself, but the df object would also have a link to its "children" metadata attachments.

Obviously there are some details there to work out, but I think it'd be cool to allow this kind of "metadata" rollup from children to parents.

bkamins commented 4 years ago

I think we are getting close to the design on two levels (general here and DataFrames.jl specific).

@pdeffebach - it would be great, as @nalimilan suggested, if you could comment in the DataFrames.jl (in a new issue or in the old PR related to metadata) what functionality you would like to have. I get a feeling that you have a good set of rules in mind, but we need to make them very precise (like what exactly should happen after what operations). What I mean here is that I want to avoid going into implementation of anything before there is a consensus how things should work.

For example - if I understood you correctly (but please comment on this in DataFrames.jl not here to keep this issue general) - if you have a dataframe df with a column :col then you want:

the result of df.col not to have any metadata attached
but on the other hand by doing df.col = some_new_value then the metadata should be kept
given the two rules above I was not clear for example what you wanted to happen in the following cases:
- df.col2 = df.col (I guess you do not want col2 to have any metadata)
- if you then do select!(df, :col => :col2, :col2 => :col) - then still :col should have metadata and :col2 should not have metadata

pdeffebach commented 4 years ago

Created a thread for DataFrames specific dicsussion here.

With f(df.col), the function f doesn't know that the vector it's being passed has the name :col. The same rule should apply with metadata I think.

nalimilan commented 4 years ago

I've also filed https://github.com/JuliaData/Tables.jl/issues/176 to discuss how Tables.jl could use the general mechanism defined in this issue to allow exchanging per-column metadata between Table types. That way we can concentrated on the most general interface here.

bkamins commented 4 years ago

Arrow.jl supports Dict{String, String} on a table level and on column level. Given this I would re-surface the discussion about how metadata system for tabular data in Julia should be defined.

There are different opinions on this, so let me give my take (but I am open to other opinions).

I personally would be OK with only supporting Dict{String, String} on a table level and delegate to AbstractVector subtypes to define metadata for column vectors if needed. What are the benefits of this:

we can store the metadata in Arrow.jl without losing the information.
we do not enforce any semantic meaning to the metadata, so you can do whatever you like with it

What are the cons of this approach:

only String=>String mappings would be supported, but the question is do we really need other kind of metadata in practice (especially given that when written by Arrow.jl it would be lost)
you have to manually manage the metadata (e.g. if you stored on a table level the mapping column_name => column_description then when e.g. renaming columns it will not get automatically updated); however, I am not sure it is that useful to do it automatically, but probably Stata users can chime in here (CC @pdeffebach)

So in summary - my proposal is to be very minimal, potentially mentioning in the future that this metadata system might be extended in the future (i.e. that something more than Dict{String, String} on table level might be supported, so relying on the exact type of metadata is discouraged). However, I believe that this way we could quickly have "some metadata", and in the future extend it.

I am not very attached to this idea, but what I would love to have is something we can agree on and is easy enough, so that we can provide the functionality in a reasonable time frame. However, if someone has a superior proposal that is consistent (and clear how metadata should be handled under transformations) I would love to hear what it would be (I know that we already had many discussions about it - I think that the way forward is just to put some "end to end" proposals on the table and discuss their pros and cons).

Tokazama commented 4 years ago

I made the Metadata.jl package for this. It provides syntax for binding metadata directly through a struct or in a global variable.

bkamins commented 4 years ago

So e.g. for DataFrame (it is immutable) you would use the attach_metadata function - right?

Tokazama commented 4 years ago

You can use either because each DataFrame has a unique objectid. I do have some support for dimension specific metadata, I haven't done anything that is specific for tables b/c I wanted it to be as generic and flexible as possible.

Here's quick example demonstrating globally stored metadata for a DataFrame.

julia> df = DataFrame(x = 1:2, y = 3:4)
2×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │

julia> @attach_metadata(df, Dict(:types => [Int, Int]))
Dict{Symbol,Any} with 1 entry:
  :types => DataType[Int64, Int64]

julia> df
2×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │

julia> @metadata(df)
Dict{Symbol,Any} with 1 entry:
  :types => DataType[Int64, Int64]

julia> df2 = DataFrame(y = 1.0:2.0, z = 3.0:4.0)
2×2 DataFrame
│ Row │ y       │ z       │
│     │ Float64 │ Float64 │
├─────┼─────────┼─────────┤
│ 1   │ 1.0     │ 3.0     │
│ 2   │ 2.0     │ 4.0     │

julia> @attach_metadata(df2, Dict(:types => [Float64, Float64]))
Dict{Symbol,Any} with 1 entry:
  :types => DataType[Float64, Float64]

julia> @metadata(df)
Dict{Symbol,Any} with 1 entry:
  :types => DataType[Int64, Int64]

julia> @metadata(df2)
Dict{Symbol,Any} with 1 entry:
  :types => DataType[Float64, Float64]

nalimilan commented 4 years ago

Metadata.jl is interesting! Though for DataAPI/Tables.jl we don't necessarily have to choose a particular implementation: all we need to do is define an API that particular table types can implement, using Metadata.jl or other solutions.

What about starting with the following minimal API in Tables.jl:

Table-level metadata: metadata(tbl) has to return Union{Nothing, AbstractDict{String, String}}. We can make this more general (now or later) by allowing any dict-like object, but that doesn't change things radically. The default implementation in Tables.jl returns nothing.
Column-level metadata: metadata(tbl, col::Union{Integer, Symbol}) also has to return Union{Nothing, AbstractDict{String, String}}. Tables can implement this by storing column-level metadata either in the table or in vectors themselves (e.g. using Metadata.jl). The default implementation in Tables.jl returns nothing.

An alternative proposal would be to completely skip the second point, and instead of metadata(tbl, col) attach column-level metadata to vectors themselves and require using metadata(Tables.getcolumn(tbl, col)). The constraint with this is that it would require that we agree on a common system for metadata that works for vectors (like Metadata.jl) and that tables cannot store metadata at the table level if they want. The advantage would be that people are guaranteed to be able to retrieve metadata when they only have vectors (without the table).

Once we agree on a minimal common API, we can discuss what should happen e.g. when concatenating, joining or transforming data frames at https://github.com/JuliaData/DataFrames.jl/issues/2276.

bkamins commented 4 years ago

Union{Nothing, AbstractDict{String, String}} makes sense and we could say that in the future Union{Nothing, AbstractDict} might be supported, so people should not rely on the parameters of AbstractDict in dispatch (or maybe we could already assume AbstactDict only and just Arrow.jl would convert what it gets to String when serializing).

metadata(tbl) - I think this is uncontroversial. And in general it does not have to be Tables.table but any object (just define metadata(::Any) = nothing. This would provide a nice fallback for the second method in case a vector defined metadata (so the table could fetch it).

metadata(tbl, col::Union{Integer, Symbol}) - here the question is if AbstractString should be also allowed? I think it is non-problematic to have this method in general, as till we support it we could just return nothing.

Tokazama commented 4 years ago

An alternative proposal would be to completely skip the second point, and instead of metadata(tbl, col) attach column-level metadata to vectors themselves and require using metadata(Tables.getcolumn(tbl, col)).

Unless metadata(::T, ::Symbol) is specifically defined for T the default is to just grab all the metadata and if it is an AbstractDict us getindex and otherwise use getproperty (as defined here).

This was the discourse announcement, in case that's any help.

Once we agree on a minimal common API, we can discuss what should happen e.g. when concatenating, joining or transforming data frames at JuliaData/DataFrames.jl#2276.

I have some internal traits for copying/sharing/dropping metadata that might be useful for this sort of thing when the time comes.

quinnj commented 4 years ago

Metadata.jl is indeed interesting and in the direction I had in mind w/ the Arrow.jl methods (which I really just threw together in order to support the arrow specification, leaving the "generalizing" to a later project). I don't love the use of macros because they're not really doing anything? I feel like just having Metadata.get(obj) and Metadata.set(x, meta) would be simpler/clearer? I can open an issue at Metadata.jl repo to discuss the details there more.

@nalimilan, what's the advantage of having a Tables.jl-level API for metadata as opposed to just pointing people to something like Metadata.jl?

bkamins commented 4 years ago

Having Tables.jl-level API generates less dependencies, e.g. for Arrow.jl I think.

nalimilan commented 4 years ago

metadata(tbl, col::Union{Integer, Symbol}) - here the question is if AbstractString should be also allowed? I think it is non-problematic to have this method in general, as till we support it we could just return nothing.

@bkamins That's a minor point I'd say. We should be consistent across Tables.jl, so better discuss getcolumn, etc. at the same time, and separately from this issue which is already complex enough.

An alternative proposal would be to completely skip the second point, and instead of metadata(tbl, col) attach column-level metadata to vectors themselves and require using metadata(Tables.getcolumn(tbl, col)).

Unless metadata(::T, ::Symbol) is specifically defined for T the default is to just grab all the metadata and if it is an AbstractDict us getindex and otherwise use getproperty (as defined here).

@Tokazama Note that in my proposal Tables.metadata would be a different (and unexported) function from Metadata.metadata. Tables.metadata(tbl, col) would retrieve metadata for column col, while metadata(tbl)[key] would access table-level attribute key. Otherwise there could be conflicts e.g. if a column is called name and you want to store a table-level attribute called name.

@nalimilan, what's the advantage of having a Tables.jl-level API for metadata as opposed to just pointing people to something like Metadata.jl?

@quinnj The reason is that we need to define an API to access column-level metadata. I agree something like Metadata.jl is enough if we decide that column-level metadata should be attached to vector objects themselves rather than stored in the table (my second proporsal). But I think @pdeffebach had arguments against it.

pdeffebach commented 4 years ago

Having metadata be persistent across joins and reassignment is a crucial feature.

If there was a Tables.jl level API for could make assurances about the persistence of metadata. @quinnj if you aren't too familiar with Stata, this is basically the model for the behavior I would like metadata to have. In Stata, it's all about persistence.

My argument against having metadata attached to vector objects is that

We don't want df.a to give you some special vector type that has metadata attached to it
We copy a lot of places, i.e. after every select and transform. So the notion of meta-data being attached to a particular object gets complicated.

I'm going to cc @matthieugomez here, since he is someone familiar with Stata who has probably also thought about this in Julia.

Tokazama commented 4 years ago

I don't love the use of macros because they're not really doing anything?

Similar to @doc, they point to the module where @attach_metadata was called. If you have a type that will always store metadata in the same module you could hard code that in and use metadata.

nalimilan commented 4 years ago

1. We don't want `df.a` to give you some special vector type that has metadata attached to it

@pdeffebach With @attach_metadata it wouldn't be a special type, just a plain Vector with metadata stored in a global dict.

2. We copy a lot of places, i.e. after every `select` and `transform`. So the notion of meta-data being attached to a particular object gets complicated.

We could copy metadata in select and transform. Overall I think the question of persistence should be addressed by particular implementations (e.g. DataFrames). Tables.jl doesn't care about that, it just has to allow you to pass metadata along with tables.

pdeffebach commented 4 years ago

Lets say I have

df = DataFrame(a = [1, 2], b = [3, 4])
metadata!(df, :a => "A", :b => "B")
@pipe df |>
    transform(_, [:a, :b] => ByRow(+) => :c) |>
    select(_, :b, :c)

with the meadata in a global Dict, how would this work? I'm confused by what the keys and what the values are.

What happens when the columns are copied inside the transform? You could imagine this global dict getting very ver large if we have thousands of columns and a lot of transform calls in a pipe.

JuliaData / DataAPI.jl

`metadata` method #22