Closed Tokazama closed 2 years ago
Yes, GC of this dict is an issue I think.
Yes, using a global dict will certainly be slower and less memory-efficient than storing column-level metadata in the table (especially since in that case we can store metadata using a vector with one entry for each column, and use the data frame index to map names to positions, like at https://github.com/JuliaData/DataFrames.jl/pull/1458). But I wonder whether it really matters in practice: if you need to copy the column vector anyway, copying the metadata should be cheap in comparison.
Maybe we could also add a finalizer to column vectors when adding metadata, so that we can delete the entry from the global dict when the object it destroyed? @Tokazama Have you considered that?
Maybe we could also add a finalizer to column vectors when adding metadata, so that we can delete the entry from the global dict when the object it destroyed?
I don't think that's possible without type piracy. There is no unique type provided by "Metadata" that wraps an instance that is attached to global metadata when using @attach_metadata(x, meta)
. The most useful part of using global metadata is that it has absolutely no effect on dispatch so it can't possibly slow down your code, increase latency through codegen, etc.. This also means you cant propagate metadata by dispatching on a type that binds metadata to your table. If that's what you want, use attach_metadata
, so that metadata is directly bound to your table with a wrapper that stores metadata in its structure.
If you want to do what @quinnj is suggesting, something like this should work...
function Metadata.global_metadata(tbl::MyTableType, column_name, module_name)
return Metadata.global_metadata(getproperty(tbl, column_name), m)
end
...redirecting @metadata(tbl, column_name)
to the relevant column.
Similary, you can do this if you expect your columns to wrap metadata.
Metadata.metadata(tbl::MyTableType, column_name) = metadata(getproperty(tbl, column_name), m)
Handling persistence of metadata without a wrapper type (i.e. global metadata) would just require actively using @share_metadata/@copy_metadata
. You might want this to be optional (e.g., some_method(args...; share_metadata)
so that you don't hurt performance by searching for metadata in every instance in every method.
I don't think that's possible without type piracy. There is no unique type provided by "Metadata" that wraps an instance that is attached to global metadata when using
@attach_metadata(x, meta)
. The most useful part of using global metadata is that it has absolutely no effect on dispatch so it can't possibly slow down your code, increase latency through codegen, etc..
Adding a finalizer doesn't require having a special type AFAICT. You can just call finalizer(f, obj)
.
The concern about performance is that in DataFrames transform
copies all columns, so if we want to preserve metadata we would have to also add attach it to the new vectors, adding them to the global dict. If we don't remove metadata which has been attached to objects that have been destroyed by GC from the dict, it will grow indefinitely, which can be a problem.
Adding a finalizer doesn't require having a special type AFAICT. You can just call finalizer(f, obj).
Well, that's extremely good to know. I'm trying to add that now but it has the caveat that it only works with mutable structs. Any suggestions on getting this to work with something like DataFrame
?
I think this discussion has surpassed my technical knowledge but as the co-author of https://github.com/JuliaData/DataFrames.jl/pull/1458 (Milan made the design), I like it's implementation. It's super transparent, I could make PRs to it, and users can understand it with a conceptual model. It's just a bit scary for me to have metadata be implemented by a global dict that is invisible to the user, but that could just be my lack of technical knowledge.
I do not think in DataFrames.jl we have to use a default metadata mechanism - we can do whatever we like. That is why we are discussing it in DataAPI.jl as I would like first to agree on the API of getting metadata and, if possible for setting metadata (but this is less crucial as I believe different table types might provide custom mechanisms for setting the metadata).
I think it is important to keep the API and the implementation separate, as otherwise we might run into problems in the future that might be hard to envision currently. Metadata.jl is very nice but it should be an opt-in I think, i.e. if some table type likes Metadata.jl it can start depending on it; but it should not be enforced.
Yes, keeping API and implementation separate is usually a good thing. But a difficulty here is that if we don't add Tables.metadata(tbl, col)
to the API, then the only way to support column-level metadata is to attach them to column vectors themselves. And that's only possible if we either 1) use a global dict like Metadata.jl, or 2) wrap vectors in a custom type (which is a no-go IMO). Only Tables.metadata(tbl, col)
allows storing per-column metadata in the DataFrame itself (with the drawback that it cannot be retrieved if you only have the vector; not sure whether it's a problem).
I think having Tables.metadata(tbl, col)
is not a problem to have.
The default implementations could be:
Tables.metadata(tbl, col) = Tables.metadata(Tables.getcolumn(tbl, col))`
Tables.metadata(tbl) = nothing
which would also cover the case of a default table-level metadata.
Now Tables.metadata(tbl, col)
can be left as is, or if a table type has some way of keeping metadata for columns on a table level then simply Tables.metadata(tbl, col)
can have a special method added.
In particular:
Tables.metadata(tbl::AbstractDataFrame, col) = # some custom implementation
Tables.metadata(tbl::AbstractDataFrame) = # some custom implementation
can use a completely different code path.
The only problem to solve is if both vector and table define metadata for column which should take the precedence, but this should be solved at AbstractDataFrame
implementation level.
In my opinion, metadata only makes sense at the level of table. Arrays should not have metadata themselves. i.e.
x = df.x
x
should have no metadata attached to it, since any metadata can only be understood in the context of the table which it came from. Since x
now lives on its own, separated from the DataFrame, it's not worth having any metadata attached to it.
In my opinion, metadata only makes sense at the level of table.
I agree that it is also my use case. However, we should design a flexible system that would fit different use cases. I can imagine that people might want to attach metadata to anything in general (this is what Metadata.jl provides now).
Note that in order to have column-level metadata you would have to opt-in for this (normal Vector
s do not have metadata). So why disallowing this if someone wants to do it?
Recently we had a similar discussion related to AbstractMatrix
being or not being a table. We decided to go for a flexible design allowing custom matrix types to have a different table representation than the default one (and I am OK with this, although I have used DataFrames.jl for years and always converted matrices to tables in a way that preserved shape).
We decided to go for a flexible design allowing custom matrix types to have a different table representation than the default one
This is definitely the way to go. I'm currently using this for graphs, tables, and arrays. Performance and storage needs are different for each of these, but it's nice to be able to use a predictable interface for accomplishing this.
For example, you could do this in DataFrames.jl
struct DataFrameColumnMetadata{T<:AbstractDataFrame} <: AbstractDict{Symbol,Any}
tbl::T
end
function metadata(x::DataFrameColumnMetadata, k)
c = getcolumn(x.tbl, k)
if has_metadata(c)
# indicates the metadata was not found without throwing an error or interfering
# with metadata that my use `nothing` or `missing` as a meaningful value.
return Metadata.no_metadata
else
return metadata(c)
end
end
function metadata(tbl::AbstractDataFrame)
if has_metadata(tbl)
return metadata(tbl)
else
return DataFrameColumnMetadata(tbl)
end
end
But now every call to copy
, join
, select
, etc. needs to look up a global dictionary about metadata, right? It's hard to imagine this scaling well.
needs to look up a global dictionary about metadata
There's a lot of flexibility here so that this doesn't need to be decided here.
struct DataFrame <: AbstractDataFrame
columns::Vector{AbstractVector}
colindex::Index
end
Metadata.metadata(df::DataFrame) = Metadata.global_metadata(df, Main)
struct MetaDataFrame <: AbstractDataFrame
columns::Vector{AbstractVector}
colindex::Index
metadata::Dict{Symbol,Any}
end
Metadata.metadata(df::DataFrame) = getfield(df, :metadata)
@Tokazama I don't understand how your last proposal stored column-level metadata. That's the main decision to make when designing a general API I think. Attaching metadata to the data frame itself is quite easy (either using Metadata.jl or a custom field in the struct).
It wasn't intended to illustrate anymore than that you could store metadata in an instance or global metadata. In reality you would want to ensure that the keys in the metadata correspond to columns (e.g., k
in metadata(tbl, k)
corresponds to a column).
@Tokazama I just saw you recently removed support for global metadata in Metadata.jl (https://github.com/Tokazama/Metadata.jl/commit/e88941c85ba6d9e697b19a4b888a3351ddbe9852). AFAICT this means there's no way to attach metadata to arbitrary objects without wrapping them in a new type. Is that right? This would be unfortunate as it was one of the main features we discussed above.
I can add it back in. Im still ironing out some details before releasing the next version. The new ability to set variables in modules makes it easier to do this sort of thing without macros.
We can usually correspond metadata to the the data's self, values, indices/axes, or dimensions.
In terms of propagating metadata, I think we've mainly discussed copy, share, and drop as options.
The first two options only work if the metadata refers to the data's self or when following some method that copies the entirety of the data as is.
Indexing, dropping dimensions, permutation, and reduction all change how indices/axes metadata should be propagated (often also dimensional metadata).
This isn't even addressing how two datasets' metadata interact (e.g., cat
, merge
).
However, I think these provide enough context for most situations that some form of the following would be useful
is_selfmeta(m)
: is m
metadata that corresponds to the entirety of it's corresponding data?is_axesmeta(m)
: is m
metadata that corresponds to the axes of it's corresponding data?is_valmeta(m)
: is m
metadata that corresponds to the values of it's corresponding data?is_dimmeta(m)
: is m
metadata that corresponds to dimensions of it's corresponding data?should_copy_meta(m)
: should m
be copied on propagation?should_share_meta(m)
: should m
be shared on propagation?should_drop_meta(m)
: should m
be dropped?This can make the ever branching set of possibilities with metadata far more manageable.
function index_metadata(m, inds...)
if should_drop_meta(m)
return nothing
else
f = should_copy_meta(m) ? copy : identity
if is_axesmeta(m)
f(map(getindex, m, inds))
elseif is_dimmeta(m)
f(dropints(m, inds))
elseif is_valmeta(m)
f(m[inds...])
else
f(m)
end
end
end
There are certainly plenty of details that remain to make this into a robust generic interface, but I thought it might at least provide some helpful thoughts on how to proceed.
@Tokazama I just saw you recently removed support for global metadata in Metadata.jl (Tokazama/Metadata.jl@e88941c). AFAICT this means there's no way to attach metadata to arbitrary objects without wrapping them in a new type. Is that right? This would be unfortunate as it was one of the main features we discussed above.
I pulled out the globally stored metadata stuff into a new package and I'm registering it now https://github.com/JuliaRegistries/General/pull/63519.
It is often the case that one wants to attach metadata of some sort to an array/graph/etc. How do people feel about adding something basic like
metadata(x) = nothing
that can then be extended by other packages?