Closed ExpandingMan closed 2 years ago
I know this is abusing the interface, but very little since Int
is unnecessarily restrictive. Case in point about defining the methods
Expression: Parquet2.colmetadata(ds[1], :ints, "mac") ≡ nothing
MethodError: colmetadata(::Parquet2.RowGroup{Parquet2.FileManager{FilePathsBase.PosixPath}}, ::Symbol, ::String) is ambiguous. Candidates:
colmetadata(::T, ::Symbol, ::AbstractString; style) where T in DataAPI at /home/expandingman/.julia/packages/DataAPI/pEtqn/src/DataAPI.jl:378
colmetadata(rg::Parquet2.RowGroup, col::Union{Integer, Symbol}, k::AbstractString; style) in Parquet2 at /home/expandingman/.julia/dev/Parquet2/src/schema.jl:1308
Possible fix, define
colmetadata(::T, ::Symbol, ::AbstractString) where T<:Parquet2.RowGroup
currently no way as part of the API to fetch with a default such as in
Base.get
.
We considered this. As you have commented it is currently doable by double lookup. Therefore we decided that this functionality can be added later in a non-breaking way if users report that this is needed.
If we add it I would propose that this is handled by the default
keyword argument (not positional argument), as it will be more explicit in the code. I agree with your implementation that if style is required then :default
should be returned.
We can add it now if we want. @nalimilan - what do you think?
There isn't a clean way in the API of fetching all metadata
This is the same reason as above. It can be added later in a non-breaking way if it is needed. We can define metadata(obj; style::Bool)
and colmetadata(obj, col; style::Bool)
as functions which are required to return readable iterators of key => value
pairs or key => (value, style)
pairs (depending on the style
kwarg). In particular it can be just returning the object from the parent, but does not have to be.
Also colmetadata(obj; style::Bool)
can be added that returns a readable iterator of col => colmetadata(obj, col; style=style)
vaules.
We can add it now if we want. @nalimilan - what do you think?
This is surely not a typical case, but it seems worth pointing out that Tables.jl isn't enough to specify what
colmetadata
should do.
Tables.jl defines:
columns = Tables.columns(x)
column = Tables.getcolumn(columns, col)
the metadata returned by colmetadata
should be metadata that corresponds to the returned column
since this is what Tables.jl user sees. Now, indeed it is a hard choice what to do if internally package creates the object returned in Tables.getcolumn(columns, col)
by merging different columns that can have different metadata (as this is what happens in Parquet if I understand correctly - right?). However, I think that the decision what to do in this case should be made on package level. User should see what metadata is reasonably assigned to Tables.getcolumn(columns, col)
. I think, as you commented, that doing metadata merging should be done in this case on package level (a natural way would be to return only the metdata that is defined in the same way in all chunks)
Defining
ArgumentError
fallbacks seems a bit dubious. These clearly should beMethodError
if there is no reasonable fallback.
Maybe you are right. Defining the methods in DataAPI.jl indeed leads to method ambiguities if methods in a package implementing the interface are not implemented carefully enough (as you have noticed, I had the same problem in DataFrames.jl).
Fortunately, following semver rules changing exception type is not considered breaking. Therefore, if we decide to remove the default implementations in the next release this will not be a breaking change. However, I would say that this is the most urgent decision to be made, because while this is non breaking still it is better to change the exception type sooner than later if we want to go this way.
Again, @nalimilan - what do you think?
metadata!(x, k, default) metadata!(f, x, k)
I am not clear what you propose here (these are methods for setting metadata as they have !
suffix) - was this a typo?
I know this is abusing the interface, but very little since
Int
is unnecessarily restrictive.
Int
is required, because Tables.jl requires Int
only and does not allow any Integer
(e.g. it does not allow Bool
).
However, I guess, when we remove the default implementations, as you propose above, then these ambiguities will be resolved without having to add extra methods.
In summary:
default
is OK, but I would add it as a keyword argument; the decision to be made is if we add it to DataAPI.jl now, or we add it latermetadata(obj; style)
, colmetadata(obj, col; style)
, and colmetadata(obj; style)
are OK and they should be required to return readable iterators (the point is that it should not return mutable internal data structures as returning them can corrupt the contents if user by mistake mutates them). We just need to decide if in this case we want to allow passing style
kwarg or not (we can add this option later); again - the question is if we want to add it to DataAPI.jl now or later.ArgumentError
to MethodError
by un-defining default methods is acceptable. It will make defining methods in packages easier by avoiding dispatch ambiguities.But let us wait a bit for others to comment.
I'm fine with dropping fallback methods if they create ambiguities. (I think there's now a mechanism to print a custom message for certain MethodError
s, though that's probably overkill here and could be added later anyway.)
Other additions seem fine, we can add it as soon as somebody needs them (not sure whether @ExpandingMan is in that position or not).
metadata!(x, k, default) metadata!(f, x, k)
I am not clear what you propose here (these are methods for setting metadata as they have
!
suffix) - was this a typo?
IIUC these would be equivalent to get!
. And indeed if we want to support the do
syntax f
has to be the first positional argument, not a keyword argument.
(I'm still kind of afraid that we are reinventing the AbstractDict
API and that it would have been simpler to require implementations to expose an AbstractDict
type to users instead, but well...)
The problem with AbstractDict
API is that it is not well documented and thus difficult to implement correctly. I would not add these get!
-equivalent methods for now (I do not think they are very useful in metadata context). I am OK with convenient get
-equivalent. Also I would not support the do
syntax.
@ExpandingMan - the idea was to limit the complexity of methods that packages are required to implement to make it easier to opt-in for their maintainers. The non-breaking proposals that you listed were considered, but we thought they can be added later.
Alright, so it sounds like we are mostly in agreement concerning the points I made in my OP. I suppose there's nothing to be done in the general case about my 3rd point about the relationship between tables and columns not always being so starightforward. Indeed, it makes things pretty ugly in Parquet2.jl but I don't see what the alternative would be.
@ExpandingMan - the idea was to limit the complexity of methods that packages are required to implement to make it easier to opt-in for their maintainers. The non-breaking proposals that you listed were considered, but we thought they can be added later.
I completely agree with this approach and it is arguably the standard approach to creating interfaces in Julia, but there weren't any implemented fallback methods whatsoever so I was a bit puzzled about what the intentions were. Perhaps they simply were not added yet.
If we add it I would propose that this is handled by the default keyword argument (not positional argument), as it will be more explicit in the code. I agree with your implementation that if style is required then :default should be returned.
That seems a bit tricky due to the fact that keyword arguments must already have a default (at least in this case). So, if you were to e.g. choose nothing
, should nothing
always be the default? Then how do you get back the error-throwing behavior?
I am more convinced than ever that the methods should be removed so that it gives MethodError
rather than ArgumentError
. Furthermore, as I understand it, the reason that there are e.g. Base.getindex
methods that take fully-specified Int
arguments is because Base
also provides a bunch of generic methods so you want it to pick the Int
methods when possible. In this case, we don't have a bunch of generic methods, so choosing Int
for the methods in DataAPI is inappropriate. I suppose this is a moot point if they are removed.
That seems a bit tricky due to the fact that keyword arguments must already have a default (at least in this case).
That is why I have written in #53 that I have thought how to do it as I had exactly the same first reaction 😄. The solution would be to define a special type struct DataAPI.NoDefaultMetadata end
that would be used as a default and then would throw an error. I.e. the default would be default=DataAPI.NoDefaultMetadata()
.
@nalimilan - do you think positional or keyword argument should be preferred for default value. Pros of default as positional: it is consistent with dictionary API and a bit shorter to type Pros of default as keyword: it is more explicit in the code what the passed argument means (especially for readers that are not using metadata API often - which is likely the case).
I've started implementing
metadata
andcolmetadata
for Parquet2.jl. I have a few thoughts, sorry for not bringing this up when this was being discussed but there was a lot of conversation and I tuned out at some point.Base.get
. In many situations this means there is no way of fetching data without at least 2 lookups.Dict(k=>metadata(x, k) for k \in metadatakeys(x))
which seems a bit awkward, especially considering that in many cases the object is probably just sitting there in the first place and shouldn't have to be reconstructed.colmetadata
is more complicated than this API suggests. For example, in Parquet2 aDataset
is a table that has columns which are concatenations of sub-columns which belong ti sub-tables (which are also Tables.jl tables) calledRowGroup
. It's therefore not possible to definecolmetadata
onDataset
because it would be ambiguous which column metadata should be used (or whether it would be appropriate to merge them). This is surely not a typical case, but it seems worth pointing out that Tables.jl isn't enough to specify whatcolmetadata
should do.ArgumentError
fallbacks seems a bit dubious. These clearly should beMethodError
if there is no reasonable fallback. The most obvious consequence of this is that error handling routines might catch a wrong error here. Nothing else immediately comes to mind, though I do vaguely remember somebody writing a blog post at some point describing why this pattern leads to trouble... I'd also be a little worried about it making method ambiguity cases worse.I realize that opening this issue might seem like more of an annoyance than anything else since the ship has sailed and now we'd have to deal with breakage. However there might still be room to add a few methods such as, perhaps
Perhaps it's already fine for packages to include these but in that case perhaps it should be documented.