Closed bkamins closed 2 years ago
Base: 96.55% // Head: 95.34% // Decreases project coverage by -1.20%
:warning:
Coverage data is based on head (
1dd4887
) compared to base (32ef840
). Patch coverage: 100.00% of modified lines in pull request are covered.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
I also changed the requirement from specific exception types to just throwing errors in the docstrings.
@ExpandingMan + @nalimilan
I would propose to try to keep the momentum in this discussion (of course respecting your other duties 😄). I think that it is pretty clear what we need to decide:
default
kwarg now or wait till users need it (I am OK with either approach - but I agree with what @nalimilan proposed that we can define it later if users need it after we learn how people use metadata; I already thought how to do it with kwarg)metadata
and colmetadata
methods that would return full information on metadata now or wait till users need it (I am OK with either approach; the comment is the same as above)metadatakeys
and colmetadatakeys
by default (with returning ()
) or we want to throw MethodError
if metadata is not defined (I would keep returning ()
)The issue of column consisting of several chunks should be resolved in Parquet2.jl I think (as I have commented in #52).
Thank you!
I also commented in #52 .
In response to your questions above (respectively):
default
methods in current Parquet2.jl master, but otherwise I don't feel any great urgency on this.AbstractDict
objects and without a method to return the whole thing, we are mandating that these be copied rather than referenced. It seems silly to have the objects with all the behavior we want just sitting there and not expose them.metadatakeys
and colmetadatakeys
methods since it is ambiguous whether object metadata is invalid (should be an error) or simply empty in a particular case.So, in summary, yes, this PR looks good to me.
I don't really want to remove the
default
methods in current Parquet2.jl master
Parquet2.jl can have more definitions than DataAPI.jl - this is OK. Let us just decide if default
should be positional or kwarg as commended in #52, so that when we later add something to DataAPI.jl there is less friction.
It seems silly to have the objects with all the behavior we want just sitting there and not expose them.
Then I would for now define metadata(obj)
and colmetadata(obj, col)
for now (without style
kwarg) and require them to return an AbstractDict
with values that are (value, style)
and add the comment in the docstring that they are indented as low-level API. The reason for not supporting the style
kwarg is simplicity. Note that in DataFrames.jl we would need to always return a copy for performance reasons (so this is a different case than Parquet2.jl).
I think we should remove the
metadatakeys
andcolmetadatakeys
methods since it is ambiguous whether object metadata is invalid (should be an error) or simply empty
@nalimilan - what do you think. My original thinking was that every object supports metadata, but just for some objects there is no way to set metadata with e.g. metadata!
so such objects just always have an empty set of metadata keys. But we can switch to the definition proposed by @ExpandingMan that some objects do not support metadata (and that this is different from having no metadata).
@Tokazama - can you please also comment on this point, as you are creating a more general metadata mechanisms.
It would be preferable to have metadata(obj, key, default)
instead of a keyword argument. In my experience with LoopVectorization, we found some instances where managing keywords would do some funny stuff to performance sensitive code.
Then I would for now define
metadata(obj)
andcolmetadata(obj, col)
for now (withoutstyle
kwarg) and require them to return anAbstractDict
with values that are(value, style)
and add the comment in the docstring that they are indented as low-level API. The reason for not supporting thestyle
kwarg is simplicity. Note that in DataFrames.jl we would need to always return a copy for performance reasons (so this is a different case than Parquet2.jl).
This is at the heart of why I'm hesitant to just return the underlying metadata storage type. There are going to be different approaches for how style
is stored. We currently only require style can be returned as part of a tuple and that :default
is somehow supported. There are instances where the style is the same for each value and you may just want to have something that stores that alongside your dictionary. I'm also approaching some of this with metadata backed by NamedTuple
and subtypes of MetaStyle
so that the current API is met doing (md[key], MetaStyle(md[key])
.
@Tokazama - so what would be your recommendation? I understand that:
default
a positional argument (like in Parquet2.jl now) - I accept your reasoning and we can agree to have it this way (we can add it to DataAPI.jl at any time, but Parquet2.jl can already add it now)metadata(obj)
and colmetadata(obj, col)
in DataAPI.jl as different types can internally support storing of metadata in different ways (Parquet2.jl can define it for its own purposes as it does not conflict with DataAPI.jl but probably should document this as an extension that potentially might change in the future - though this is unlikely that we will have a conflict there)@Tokazama - and what do you think about the default behavior for metadatakeys
? should it be MethodError
or ()
should be returned?
- do not provide
metadata(obj)
andcolmetadata(obj, col)
in DataAPI.jl as different types can internally support storing of metadata in different ways (Parquet2.jl can define it for its own purposes as it does not conflict with DataAPI.jl but probably should document this as an extension that potentially might change in the future - though this is unlikely that we will have a conflict there)
I can make metadata(obj)
work (with a little work on top of PropertyDicts.jl). I just wanted to point out some reasons why we may not have decided to do this originally. Even this seemingly innocuous solution could have some sharp edges.
@Tokazama - and what do you think about the default behavior for
metadatakeys
? should it beMethodError
or()
should be returned?
I can see how there's a bit of awkwardness around defining default keys, especially when we don't have default metadata elsewhere. It does leave us without any generic way of checking if an object has metadata though.
Having something like metadata(obj)
would solve #51, so perhaps that's the way to go.
I think it would be worth having MetadataStyle
and MetadataDefault <: MetadataStyle
here at some point, but I've been trying to work on some compelling examples before fully proposing that.
I think we should remove the
metadatakeys
andcolmetadatakeys
methods since it is ambiguous whether object metadata is invalid (should be an error) or simply empty@nalimilan - what do you think. My original thinking was that every object supports metadata, but just for some objects there is no way to set metadata with e.g.
metadata!
so such objects just always have an empty set of metadata keys. But we can switch to the definition proposed by @ExpandingMan that some objects do not support metadata (and that this is different from having no metadata).
I agree it's useful to have fallbacks returning ()
, as otherwise as @Tokazama noted there's no way to check whether an object contains metadata (other than try... catch
, which shouldn't be used for this). If we dropped these we would have to add hasmetadata
returning a Boolean. We discussed this when adding these functions. Another option I mentioned was to have the fallbacks return nothing
. But it's not clear what's the point of being able to differentiate objects that don't support metadata from objects that support it but don't have any.
BTW I realize we don't provide a way to check whether an object supports metadata!
. Maybe there should be one?
I also agree that we'd better add the default
argument now (either positional or keyword) if Parquet implements it, to ensure API consistency.
BTW I realize we don't provide a way to check whether an object supports
metadata!
.
This was intentional. I did not see a use case of querying it. I.e. in what situation would you want to add metadata without knowing if some object supports it?
I will add default
as a positional argument as @Tokazama proposed.
This was intentional. I did not see a use case of querying it. I.e. in what situation would you want to add metadata without knowing if some object supports it?
I was thinking about some package which would accept any Tables.jl table and attach metadata to it if possible.
I was thinking about some package which would accept any Tables.jl table and attach metadata to it if possible.
But what would be the use of it if after the operation you would not be sure if metadata was added?
But maybe it makes sense. How would you call these functions ismetadatamutable
and iscolmetadatamutable
?
PS. I have added default
.
But what would be the use of it if after the operation you would not be sure if metadata was added?
I mean that the object would be returned to the caller, which would presumably choose a type that supports metadata if they care about it. For example the package could set column labels or units, which are nice but not strictly necessary to use the data.
But maybe it makes sense. How would you call these functions
ismetadatamutable
andiscolmetadatamutable
?
Yeah, that's a reasonable choice.
What I did:
metadatasuport
and colmetadatasupport
that return a named tuple indicating if metadata is supported for a given type for reading and writing (two fields); this is done for type, not for value as I think we should have a requirement that this information should be possible to be resolved statically during compile time; @Tokazama - in your use case (if I understand your approach correctly this means that once you load a package that allows setting and getting metadata for some type you need to register this type by adding methods for one or both of these functions)metadatakeys
and colmetadatakeys
implementations - by default they error now (as @ExpandingMan proposed); this is OK as we add the *support
methodsmetadata(obj)
exposing a dict for performance reasons. I think we do not need it for performance reasons as metadatakeys(obj)
should have almost zero cost, and later users should query metadata with metadata(obj, key)
. (of course you can add these to Parque2.jl if you want but I think using the API without this convenience function should not lead to performance regressions, and as @Tokazama noted sometimes requiring to return such an object might be problematic)I think that after this change (and comments) all that was expected is changed (but please comment of course, as usual maybe function naming will need discussion).
The implemented changes are slightly breaking (as we stop returning ()
for metadatakeys
and colmetadatakeys
and instead throw an error, but I think it is acceptable; all people using metadata are in this thread + I will add an appropriate note in release notes)
if I understand your approach correctly this means that once you load a package that allows setting and getting metadata for some type you need to register this type by adding methods for one or both of these functions
Yeah, I think that should suffice.
I've looked through the changes and they seem okay, but I haven't had time to meditate on them much. It's difficult to predict the implications of each decision, so the work put in on your end is definitely appreciated.
It's difficult to predict the implications of each decision
Agreed. That is why initially we wanted with @nalimilan to define a minimal set of functions to give us most flexibility later. Maybe if @ExpandingMan comments that the changes are also OK then this is a good sign (as we have a real test in Parquet2.jl where it is tested).
@Tokazama + @ExpandingMan - is it OK to merge this PR as-is? (an this would be a released API that hopefully we would not change in DataAPI.jl soon)
This looks good to me, it seems to improve on most of what I commented on in my original issue.. In general, the only thing I'd ever get worried about are changes that for whatever reason lock us out of adding methods in the future, and I don't think any of this does that.
Note also that I have not tagged Parquet2.jl yet so that'll give us a chance to run tests on that again.
Is the default MethodError
the correct way to handle errors when we have an object without any metadata that could otherwise have some? In other words, if we call metadata(df::DataFrame, key)
and df
has nothing
for its metadata, should we be throwing a KeyError(key)
or MethodError(metadata, (df, key))
?
The method error is not explicitly thrown, it is only thrown if you call a function on an object it doesn't have a method for. Therefore you won't get this with DataFrame
, you'll get a KeyError
. The problem with before this PR was that you would get ArgumentError
even on objects that did not implement any metadata methods which not only makes no sense but can lead to method ambiguities.
The method error is not explicitly thrown, it is only thrown if you call a function on an object it doesn't have a method for. Therefore you won't get this with
DataFrame
, you'll get aKeyError
. The problem with before this PR was that you would getArgumentError
even on objects that did not implement any metadata methods which not only makes no sense but can lead to method ambiguities.
I understand functionally what is happening here. I'm just wondering if there's a correct error to produce on my end. If there is no metadata collection attached (but it could have metadata attached), should it still be a KeyError
? It's not terribly important. I'm just trying to do this correctly.
I'm just wondering if there's a correct error to produce on my end.
The current state of this PR says that we should throw an error in this case without specifying error type.
Different packages can use different strategies what exact error type to return if metadata is missing. KeyError
is most natural most of the cases. MethodError
should only be returned if some object does not support metadata.
However, for example in DataFrames.jl with @nalimilan we decided to throw ArgumentError
instead of KeyError
if metadata key is missing. (although my initial implementation was throwing KeyError
). The reason is easier error handling as we return ArgumentError
consistently in other places in DataFrames.jl.
In general AFAICT we usually in Julia follow https://www.tensorflow.org/guide/versions rules for exceptions which, in particular state that changing the type of exception thrown is not breaking unless you documented the type of exception thrown (this is exactly now the docs do not specify exception type, but just state that exception is thrown).
Note also that I have not tagged Parquet2.jl yet so that'll give us a chance to run tests on that again.
Thank you. I think the only thing that needs adding is metadatasupport
and colmetadatasupport
functions in Parquet2.jl.
You define the convenience metadata(obj)
and colmetadata(obj, col)
that expose the dict, but I think it is not a problem (maybe just add a note in Parquet2.jl that this is an extension to the standard API from DataAPI.jl?)
@ExpandingMan - I have created an account 😄 and commented in https://gitlab.com/ExpandingMan/Parquet2.jl/-/issues/17 to keep track of todo things.
If there will be no additional comments on this PR I will merge it on Monday, Oct 3, 2022 and make a 1.12.0 release. OK?
I think the only thing that needs adding is metadatasupport and colmetadatasupport functions in Parquet2.jl. You define the convenience metadata(obj) and colmetadata(obj, col) that expose the dict, but I think it is not a problem (maybe just add a note in Parquet2.jl that this is an extension to the standard API from DataAPI.jl?)
I should comment on that as that may be the one thing that I don't feel is addressed in this PR. In the case of Parquet2.jl, exposing those methods is the only way to get the full metadata dict without copying it. I'm not sure I followed all of the above, but I still think there are likely to be a large number of cases like Parquet2.jl in which the metadata dict object we want is already "just sitting around" and not exposing a method to get it forces it to be copied.
I don't think this needs to be addressed now as long as the door is open, but thought I'd mention it.
I don't think this needs to be addressed now as long as the door is open, but thought I'd mention it.
Yeah, I don't think it's something set in stone. We're just trying to move forward conservatively so that we get work done now without requiring huge breaking changes down the road.
Exactly - I think these methods in the future can be added to the main DataAPI.jl specification, but since this is only a performance optimization issue it can be left for later. I would just recommend to add a documentation of these two extra methods in Parquet2.jl.
Thank you! I will release DataAPI.jl 1.12.0 soon.
Fixes https://github.com/JuliaData/DataAPI.jl/issues/52
In this PR I implement conclusions from #52 discussion.
Currently I implemented dropping of default methods. The question is - do we also want to drop
metadatakeys
andcolmetadatakeys
. I kept them returning()
by default for now. This is to make it easier to write generic functions (i.e. i check if there are some metadata keys, and if some type does not support metadata I get an empty tuple so I know there are no metadata without having to catch an error)