JuliaData / DataAPI.jl

A data-focused namespace for packages to share functions
Other
33 stars 13 forks source link

Update metadata API #53

Closed bkamins closed 2 years ago

bkamins commented 2 years ago

Fixes https://github.com/JuliaData/DataAPI.jl/issues/52

In this PR I implement conclusions from #52 discussion.

Currently I implemented dropping of default methods. The question is - do we also want to drop metadatakeys and colmetadatakeys. I kept them returning () by default for now. This is to make it easier to write generic functions (i.e. i check if there are some metadata keys, and if some type does not support metadata I get an empty tuple so I know there are no metadata without having to catch an error)

codecov[bot] commented 2 years ago

Codecov Report

Base: 96.55% // Head: 95.34% // Decreases project coverage by -1.20% :warning:

Coverage data is based on head (1dd4887) compared to base (32ef840). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #53 +/- ## ========================================== - Coverage 96.55% 95.34% -1.21% ========================================== Files 1 1 Lines 58 43 -15 ========================================== - Hits 56 41 -15 Misses 2 2 ``` | [Impacted Files](https://codecov.io/gh/JuliaData/DataAPI.jl/pull/53?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaData) | Coverage Δ | | |---|---|---| | [src/DataAPI.jl](https://codecov.io/gh/JuliaData/DataAPI.jl/pull/53/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaData#diff-c3JjL0RhdGFBUEkuamw=) | `95.34% <100.00%> (-1.21%)` | :arrow_down: | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaData). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaData)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

bkamins commented 2 years ago

I also changed the requirement from specific exception types to just throwing errors in the docstrings.

bkamins commented 2 years ago

@ExpandingMan + @nalimilan

I would propose to try to keep the momentum in this discussion (of course respecting your other duties 😄). I think that it is pretty clear what we need to decide:

The issue of column consisting of several chunks should be resolved in Parquet2.jl I think (as I have commented in #52).

Thank you!

ExpandingMan commented 2 years ago

I also commented in #52 .

In response to your questions above (respectively):

So, in summary, yes, this PR looks good to me.

bkamins commented 2 years ago

I don't really want to remove the default methods in current Parquet2.jl master

Parquet2.jl can have more definitions than DataAPI.jl - this is OK. Let us just decide if default should be positional or kwarg as commended in #52, so that when we later add something to DataAPI.jl there is less friction.

It seems silly to have the objects with all the behavior we want just sitting there and not expose them.

Then I would for now define metadata(obj) and colmetadata(obj, col) for now (without style kwarg) and require them to return an AbstractDict with values that are (value, style) and add the comment in the docstring that they are indented as low-level API. The reason for not supporting the style kwarg is simplicity. Note that in DataFrames.jl we would need to always return a copy for performance reasons (so this is a different case than Parquet2.jl).

I think we should remove the metadatakeys and colmetadatakeys methods since it is ambiguous whether object metadata is invalid (should be an error) or simply empty

@nalimilan - what do you think. My original thinking was that every object supports metadata, but just for some objects there is no way to set metadata with e.g. metadata! so such objects just always have an empty set of metadata keys. But we can switch to the definition proposed by @ExpandingMan that some objects do not support metadata (and that this is different from having no metadata).

@Tokazama - can you please also comment on this point, as you are creating a more general metadata mechanisms.

Tokazama commented 2 years ago

It would be preferable to have metadata(obj, key, default) instead of a keyword argument. In my experience with LoopVectorization, we found some instances where managing keywords would do some funny stuff to performance sensitive code.

Then I would for now define metadata(obj) and colmetadata(obj, col) for now (without style kwarg) and require them to return an AbstractDict with values that are (value, style) and add the comment in the docstring that they are indented as low-level API. The reason for not supporting the style kwarg is simplicity. Note that in DataFrames.jl we would need to always return a copy for performance reasons (so this is a different case than Parquet2.jl).

This is at the heart of why I'm hesitant to just return the underlying metadata storage type. There are going to be different approaches for how style is stored. We currently only require style can be returned as part of a tuple and that :default is somehow supported. There are instances where the style is the same for each value and you may just want to have something that stores that alongside your dictionary. I'm also approaching some of this with metadata backed by NamedTuple and subtypes of MetaStyle so that the current API is met doing (md[key], MetaStyle(md[key]).

bkamins commented 2 years ago

@Tokazama - so what would be your recommendation? I understand that:

@Tokazama - and what do you think about the default behavior for metadatakeys? should it be MethodError or () should be returned?

Tokazama commented 2 years ago
  • do not provide metadata(obj) and colmetadata(obj, col) in DataAPI.jl as different types can internally support storing of metadata in different ways (Parquet2.jl can define it for its own purposes as it does not conflict with DataAPI.jl but probably should document this as an extension that potentially might change in the future - though this is unlikely that we will have a conflict there)

I can make metadata(obj) work (with a little work on top of PropertyDicts.jl). I just wanted to point out some reasons why we may not have decided to do this originally. Even this seemingly innocuous solution could have some sharp edges.

@Tokazama - and what do you think about the default behavior for metadatakeys? should it be MethodError or () should be returned?

I can see how there's a bit of awkwardness around defining default keys, especially when we don't have default metadata elsewhere. It does leave us without any generic way of checking if an object has metadata though.

Tokazama commented 2 years ago

Having something like metadata(obj) would solve #51, so perhaps that's the way to go.

I think it would be worth having MetadataStyle and MetadataDefault <: MetadataStyle here at some point, but I've been trying to work on some compelling examples before fully proposing that.

nalimilan commented 2 years ago

I think we should remove the metadatakeys and colmetadatakeys methods since it is ambiguous whether object metadata is invalid (should be an error) or simply empty

@nalimilan - what do you think. My original thinking was that every object supports metadata, but just for some objects there is no way to set metadata with e.g. metadata! so such objects just always have an empty set of metadata keys. But we can switch to the definition proposed by @ExpandingMan that some objects do not support metadata (and that this is different from having no metadata).

I agree it's useful to have fallbacks returning (), as otherwise as @Tokazama noted there's no way to check whether an object contains metadata (other than try... catch, which shouldn't be used for this). If we dropped these we would have to add hasmetadata returning a Boolean. We discussed this when adding these functions. Another option I mentioned was to have the fallbacks return nothing. But it's not clear what's the point of being able to differentiate objects that don't support metadata from objects that support it but don't have any.

BTW I realize we don't provide a way to check whether an object supports metadata!. Maybe there should be one?

I also agree that we'd better add the default argument now (either positional or keyword) if Parquet implements it, to ensure API consistency.

bkamins commented 2 years ago

BTW I realize we don't provide a way to check whether an object supports metadata!.

This was intentional. I did not see a use case of querying it. I.e. in what situation would you want to add metadata without knowing if some object supports it?

I will add default as a positional argument as @Tokazama proposed.

nalimilan commented 2 years ago

This was intentional. I did not see a use case of querying it. I.e. in what situation would you want to add metadata without knowing if some object supports it?

I was thinking about some package which would accept any Tables.jl table and attach metadata to it if possible.

bkamins commented 2 years ago

I was thinking about some package which would accept any Tables.jl table and attach metadata to it if possible.

But what would be the use of it if after the operation you would not be sure if metadata was added?

But maybe it makes sense. How would you call these functions ismetadatamutable and iscolmetadatamutable?

PS. I have added default.

nalimilan commented 2 years ago

But what would be the use of it if after the operation you would not be sure if metadata was added?

I mean that the object would be returned to the caller, which would presumably choose a type that supports metadata if they care about it. For example the package could set column labels or units, which are nice but not strictly necessary to use the data.

But maybe it makes sense. How would you call these functions ismetadatamutable and iscolmetadatamutable?

Yeah, that's a reasonable choice.

bkamins commented 2 years ago

What I did:

  1. add metadatasuport and colmetadatasupport that return a named tuple indicating if metadata is supported for a given type for reading and writing (two fields); this is done for type, not for value as I think we should have a requirement that this information should be possible to be resolved statically during compile time; @Tokazama - in your use case (if I understand your approach correctly this means that once you load a package that allows setting and getting metadata for some type you need to register this type by adding methods for one or both of these functions)
  2. I removed metadatakeys and colmetadatakeys implementations - by default they error now (as @ExpandingMan proposed); this is OK as we add the *support methods
  3. @ExpandingMan - I have thought if we need metadata(obj) exposing a dict for performance reasons. I think we do not need it for performance reasons as metadatakeys(obj) should have almost zero cost, and later users should query metadata with metadata(obj, key). (of course you can add these to Parque2.jl if you want but I think using the API without this convenience function should not lead to performance regressions, and as @Tokazama noted sometimes requiring to return such an object might be problematic)

I think that after this change (and comments) all that was expected is changed (but please comment of course, as usual maybe function naming will need discussion).

The implemented changes are slightly breaking (as we stop returning () for metadatakeys and colmetadatakeys and instead throw an error, but I think it is acceptable; all people using metadata are in this thread + I will add an appropriate note in release notes)

Tokazama commented 2 years ago

if I understand your approach correctly this means that once you load a package that allows setting and getting metadata for some type you need to register this type by adding methods for one or both of these functions

Yeah, I think that should suffice.

I've looked through the changes and they seem okay, but I haven't had time to meditate on them much. It's difficult to predict the implications of each decision, so the work put in on your end is definitely appreciated.

bkamins commented 2 years ago

It's difficult to predict the implications of each decision

Agreed. That is why initially we wanted with @nalimilan to define a minimal set of functions to give us most flexibility later. Maybe if @ExpandingMan comments that the changes are also OK then this is a good sign (as we have a real test in Parquet2.jl where it is tested).

bkamins commented 2 years ago

@Tokazama + @ExpandingMan - is it OK to merge this PR as-is? (an this would be a released API that hopefully we would not change in DataAPI.jl soon)

ExpandingMan commented 2 years ago

This looks good to me, it seems to improve on most of what I commented on in my original issue.. In general, the only thing I'd ever get worried about are changes that for whatever reason lock us out of adding methods in the future, and I don't think any of this does that.

Note also that I have not tagged Parquet2.jl yet so that'll give us a chance to run tests on that again.

Tokazama commented 2 years ago

Is the default MethodError the correct way to handle errors when we have an object without any metadata that could otherwise have some? In other words, if we call metadata(df::DataFrame, key) and df has nothing for its metadata, should we be throwing a KeyError(key) or MethodError(metadata, (df, key))?

ExpandingMan commented 2 years ago

The method error is not explicitly thrown, it is only thrown if you call a function on an object it doesn't have a method for. Therefore you won't get this with DataFrame, you'll get a KeyError. The problem with before this PR was that you would get ArgumentError even on objects that did not implement any metadata methods which not only makes no sense but can lead to method ambiguities.

Tokazama commented 2 years ago

The method error is not explicitly thrown, it is only thrown if you call a function on an object it doesn't have a method for. Therefore you won't get this with DataFrame, you'll get a KeyError. The problem with before this PR was that you would get ArgumentError even on objects that did not implement any metadata methods which not only makes no sense but can lead to method ambiguities.

I understand functionally what is happening here. I'm just wondering if there's a correct error to produce on my end. If there is no metadata collection attached (but it could have metadata attached), should it still be a KeyError? It's not terribly important. I'm just trying to do this correctly.

bkamins commented 2 years ago

I'm just wondering if there's a correct error to produce on my end.

The current state of this PR says that we should throw an error in this case without specifying error type. Different packages can use different strategies what exact error type to return if metadata is missing. KeyError is most natural most of the cases. MethodError should only be returned if some object does not support metadata.

However, for example in DataFrames.jl with @nalimilan we decided to throw ArgumentError instead of KeyError if metadata key is missing. (although my initial implementation was throwing KeyError). The reason is easier error handling as we return ArgumentError consistently in other places in DataFrames.jl.

In general AFAICT we usually in Julia follow https://www.tensorflow.org/guide/versions rules for exceptions which, in particular state that changing the type of exception thrown is not breaking unless you documented the type of exception thrown (this is exactly now the docs do not specify exception type, but just state that exception is thrown).

bkamins commented 2 years ago

Note also that I have not tagged Parquet2.jl yet so that'll give us a chance to run tests on that again.

Thank you. I think the only thing that needs adding is metadatasupport and colmetadatasupport functions in Parquet2.jl. You define the convenience metadata(obj) and colmetadata(obj, col) that expose the dict, but I think it is not a problem (maybe just add a note in Parquet2.jl that this is an extension to the standard API from DataAPI.jl?)

bkamins commented 2 years ago

@ExpandingMan - I have created an account 😄 and commented in https://gitlab.com/ExpandingMan/Parquet2.jl/-/issues/17 to keep track of todo things.

If there will be no additional comments on this PR I will merge it on Monday, Oct 3, 2022 and make a 1.12.0 release. OK?

ExpandingMan commented 2 years ago

I think the only thing that needs adding is metadatasupport and colmetadatasupport functions in Parquet2.jl. You define the convenience metadata(obj) and colmetadata(obj, col) that expose the dict, but I think it is not a problem (maybe just add a note in Parquet2.jl that this is an extension to the standard API from DataAPI.jl?)

I should comment on that as that may be the one thing that I don't feel is addressed in this PR. In the case of Parquet2.jl, exposing those methods is the only way to get the full metadata dict without copying it. I'm not sure I followed all of the above, but I still think there are likely to be a large number of cases like Parquet2.jl in which the metadata dict object we want is already "just sitting around" and not exposing a method to get it forces it to be copied.

I don't think this needs to be addressed now as long as the door is open, but thought I'd mention it.

Tokazama commented 2 years ago

I don't think this needs to be addressed now as long as the door is open, but thought I'd mention it.

Yeah, I don't think it's something set in stone. We're just trying to move forward conservatively so that we get work done now without requiring huge breaking changes down the road.

bkamins commented 2 years ago

Exactly - I think these methods in the future can be added to the main DataAPI.jl specification, but since this is only a performance optimization issue it can be left for later. I would just recommend to add a documentation of these two extra methods in Parquet2.jl.

bkamins commented 2 years ago

Thank you! I will release DataAPI.jl 1.12.0 soon.