JuliaData / DataAPI.jl

A data-focused namespace for packages to share functions
Other
33 stars 13 forks source link

a few concerns about metadata methods #52

Closed ExpandingMan closed 2 years ago

ExpandingMan commented 2 years ago

I've started implementing metadata and colmetadata for Parquet2.jl. I have a few thoughts, sorry for not bringing this up when this was being discussed but there was a lot of conversation and I tuned out at some point.

I realize that opening this issue might seem like more of an annoyance than anything else since the ship has sailed and now we'd have to deal with breakage. However there might still be room to add a few methods such as, perhaps

metadata(x)
metadata(x, k, default)
metadata!(x, k, default)
metadata!(f, x, k)

Perhaps it's already fine for packages to include these but in that case perhaps it should be documented.

ExpandingMan commented 2 years ago

I know this is abusing the interface, but very little since Int is unnecessarily restrictive. Case in point about defining the methods

  Expression: Parquet2.colmetadata(ds[1], :ints, "mac") ≡ nothing
  MethodError: colmetadata(::Parquet2.RowGroup{Parquet2.FileManager{FilePathsBase.PosixPath}}, ::Symbol, ::String) is ambiguous. Candidates:
    colmetadata(::T, ::Symbol, ::AbstractString; style) where T in DataAPI at /home/expandingman/.julia/packages/DataAPI/pEtqn/src/DataAPI.jl:378
    colmetadata(rg::Parquet2.RowGroup, col::Union{Integer, Symbol}, k::AbstractString; style) in Parquet2 at /home/expandingman/.julia/dev/Parquet2/src/schema.jl:1308
  Possible fix, define
    colmetadata(::T, ::Symbol, ::AbstractString) where T<:Parquet2.RowGroup
ExpandingMan commented 2 years ago

This is implemented in Parquet2.jl. I wanted to point it out before I tag to give you guys a chance to object to my extra methods.

See here, here and here.

bkamins commented 2 years ago

currently no way as part of the API to fetch with a default such as in Base.get.

We considered this. As you have commented it is currently doable by double lookup. Therefore we decided that this functionality can be added later in a non-breaking way if users report that this is needed.

If we add it I would propose that this is handled by the default keyword argument (not positional argument), as it will be more explicit in the code. I agree with your implementation that if style is required then :default should be returned.

We can add it now if we want. @nalimilan - what do you think?

There isn't a clean way in the API of fetching all metadata

This is the same reason as above. It can be added later in a non-breaking way if it is needed. We can define metadata(obj; style::Bool) and colmetadata(obj, col; style::Bool) as functions which are required to return readable iterators of key => value pairs or key => (value, style) pairs (depending on the style kwarg). In particular it can be just returning the object from the parent, but does not have to be. Also colmetadata(obj; style::Bool) can be added that returns a readable iterator of col => colmetadata(obj, col; style=style) vaules.

We can add it now if we want. @nalimilan - what do you think?

This is surely not a typical case, but it seems worth pointing out that Tables.jl isn't enough to specify what colmetadata should do.

Tables.jl defines:

columns = Tables.columns(x)
column = Tables.getcolumn(columns, col)

the metadata returned by colmetadata should be metadata that corresponds to the returned column since this is what Tables.jl user sees. Now, indeed it is a hard choice what to do if internally package creates the object returned in Tables.getcolumn(columns, col) by merging different columns that can have different metadata (as this is what happens in Parquet if I understand correctly - right?). However, I think that the decision what to do in this case should be made on package level. User should see what metadata is reasonably assigned to Tables.getcolumn(columns, col). I think, as you commented, that doing metadata merging should be done in this case on package level (a natural way would be to return only the metdata that is defined in the same way in all chunks)

Defining ArgumentError fallbacks seems a bit dubious. These clearly should be MethodError if there is no reasonable fallback.

Maybe you are right. Defining the methods in DataAPI.jl indeed leads to method ambiguities if methods in a package implementing the interface are not implemented carefully enough (as you have noticed, I had the same problem in DataFrames.jl).

Fortunately, following semver rules changing exception type is not considered breaking. Therefore, if we decide to remove the default implementations in the next release this will not be a breaking change. However, I would say that this is the most urgent decision to be made, because while this is non breaking still it is better to change the exception type sooner than later if we want to go this way.

Again, @nalimilan - what do you think?

metadata!(x, k, default)
metadata!(f, x, k)

I am not clear what you propose here (these are methods for setting metadata as they have ! suffix) - was this a typo?

I know this is abusing the interface, but very little since Int is unnecessarily restrictive.

Int is required, because Tables.jl requires Int only and does not allow any Integer (e.g. it does not allow Bool). However, I guess, when we remove the default implementations, as you propose above, then these ambiguities will be resolved without having to add extra methods.


In summary:

But let us wait a bit for others to comment.

nalimilan commented 2 years ago

I'm fine with dropping fallback methods if they create ambiguities. (I think there's now a mechanism to print a custom message for certain MethodErrors, though that's probably overkill here and could be added later anyway.)

Other additions seem fine, we can add it as soon as somebody needs them (not sure whether @ExpandingMan is in that position or not).

metadata!(x, k, default)
metadata!(f, x, k)

I am not clear what you propose here (these are methods for setting metadata as they have ! suffix) - was this a typo?

IIUC these would be equivalent to get!. And indeed if we want to support the do syntax f has to be the first positional argument, not a keyword argument.

(I'm still kind of afraid that we are reinventing the AbstractDict API and that it would have been simpler to require implementations to expose an AbstractDict type to users instead, but well...)

bkamins commented 2 years ago

The problem with AbstractDict API is that it is not well documented and thus difficult to implement correctly. I would not add these get!-equivalent methods for now (I do not think they are very useful in metadata context). I am OK with convenient get-equivalent. Also I would not support the do syntax.

bkamins commented 2 years ago

@ExpandingMan - the idea was to limit the complexity of methods that packages are required to implement to make it easier to opt-in for their maintainers. The non-breaking proposals that you listed were considered, but we thought they can be added later.

ExpandingMan commented 2 years ago

Alright, so it sounds like we are mostly in agreement concerning the points I made in my OP. I suppose there's nothing to be done in the general case about my 3rd point about the relationship between tables and columns not always being so starightforward. Indeed, it makes things pretty ugly in Parquet2.jl but I don't see what the alternative would be.

@ExpandingMan - the idea was to limit the complexity of methods that packages are required to implement to make it easier to opt-in for their maintainers. The non-breaking proposals that you listed were considered, but we thought they can be added later.

I completely agree with this approach and it is arguably the standard approach to creating interfaces in Julia, but there weren't any implemented fallback methods whatsoever so I was a bit puzzled about what the intentions were. Perhaps they simply were not added yet.

If we add it I would propose that this is handled by the default keyword argument (not positional argument), as it will be more explicit in the code. I agree with your implementation that if style is required then :default should be returned.

That seems a bit tricky due to the fact that keyword arguments must already have a default (at least in this case). So, if you were to e.g. choose nothing, should nothing always be the default? Then how do you get back the error-throwing behavior?

I am more convinced than ever that the methods should be removed so that it gives MethodError rather than ArgumentError. Furthermore, as I understand it, the reason that there are e.g. Base.getindex methods that take fully-specified Int arguments is because Base also provides a bunch of generic methods so you want it to pick the Int methods when possible. In this case, we don't have a bunch of generic methods, so choosing Int for the methods in DataAPI is inappropriate. I suppose this is a moot point if they are removed.

bkamins commented 2 years ago

That seems a bit tricky due to the fact that keyword arguments must already have a default (at least in this case).

That is why I have written in #53 that I have thought how to do it as I had exactly the same first reaction 😄. The solution would be to define a special type struct DataAPI.NoDefaultMetadata end that would be used as a default and then would throw an error. I.e. the default would be default=DataAPI.NoDefaultMetadata().

@nalimilan - do you think positional or keyword argument should be preferred for default value. Pros of default as positional: it is consistent with dictionary API and a bit shorter to type Pros of default as keyword: it is more explicit in the code what the passed argument means (especially for readers that are not using metadata API often - which is likely the case).