JuliaData / DataAPI.jl

A data-focused namespace for packages to share functions
Other
33 stars 13 forks source link

Add method for iterating metadata #51

Closed Tokazama closed 1 year ago

Tokazama commented 2 years ago

Any type that supports metadata will likely need a way to iterate over metadata for propagating metadata to new instances or merging metadata. The generic way of iterating through metadata right now is using accessing each key form the iterable returned by metadatakeys. Depending on how metadata is stored this may not be the most efficient way to accomplish this. If the method that is propagating metadata internally is managing a single collection of metadata then a unique implementation can be written for whatever method of iteration is best. However, when combining multiple sets of metadata it's not practical to have a unique method for every combination of metadata storage patterns.

A couple things to consider in supporting generic iteration over metadata:

bkamins commented 2 years ago

This (and your earlier comment in the PR) raise the following issue (which I initially wanted to avoid to ensure simplicity).

We could add a method metadatadict(obj; style::Union{Symbol, Nothing}=nothing) that would be required to return an AbstractDictionary supporting its interface - and ensuring that all functions from this interface are fast. If style is not passed then all key-value pairs are returned. If style is passed then key-value pairs are filtered to only expose those whose style matches the style kwarg.

The reason why I have not added this requirement initially was that I was afraid that package maintainers would find it difficult to correctly implement this custom type and metadatadict function (AbstractDict interface is more complex than it seems on a surface and it is not properly documented). But maybe we should go for this and have metadatadict return nothing by default so that packages can easily detect that some obj does not have this method implemented.

The alternative, as you propose, would be to add metadatapairs function (on which we would need to think in detail how to specify its API, but I think it is doable).

(the same for column-level metadata, but let us leave this discussion for later)

CC @nalimilan

Tokazama commented 2 years ago

The reason why I have not added this requirement initially was that I was afraid that package maintainers would find it difficult to correctly implement this custom type and metadatadict function (AbstractDict interface is more complex than it seems on a surface and it is not properly documented). But maybe we should go for this and have metadatadict return nothing by default so that packages can easily detect that some obj does not have this method implemented.

It's definitely tempting to have a way of directly interacting with an AbstractDict type so we don't have to ever implement something that could just fall back to a dictionary. It's also nice to have a fairly clear idea of how an object behaves when trying to write quick code.

On the other hand, I have similar apprehensions about the cost of an increasingly strict interface at this point. I recall some trial and error to get interfaces like AbstractDict to work when first learning Julia. Perhaps we could have an extra package that provides a handful of useful types that support the dictionary interface that also support features we require for the metadata interface.

bkamins commented 2 years ago

Let us wait for what @nalimilan thinks

nalimilan commented 2 years ago

Given that we have metadatakeys, having metadatapairs would be kind of logical.

metadatadict is a more tricky issue. If not all packages implement it, how useful would it be in practice? Would we need a fallback that allocates a new Dict? But in that case mutating it wouldn't affect the original object.

Perhaps we could have an extra package that provides a handful of useful types that support the dictionary interface that also support features we require for the metadata interface.

What would this provide compared with simply using Dict (which is what DataFrames.jl does for example)?

bkamins commented 2 years ago

Would we need a fallback that allocates a new Dict?

It could be a dynamic dict being a view, but I agree - as commented above - that it is tricky.

In summary - do we think adding metadatapairs makes sense now, or we add it at some later point when someone needs it?

nalimilan commented 2 years ago

As usual, I'd rather wait until somebody really needs it. ;-)

nathanrboyer commented 1 year ago

I couldn't follow all the discussion in #53, but it would make sense to me to just have metadata(df) dump all the pairs (rather than a new function). I am currently doing this which is a pain:

julia> Dict(key => metadata(df, key) for key in metadatakeys(df))
Dict{String, Float64} with 2 entries:
  "RMSE" => 0.10915
  "t₀"   => 3.974
bkamins commented 1 year ago

and the same for colmetadata(df, col) and colmetadata(df)?

CC @nalimilan