JuliaData / DataAPI.jl

A data-focused namespace for packages to share functions
Other
33 stars 13 forks source link

Implement `hasmetadata(x)` #50

Closed Tokazama closed 2 years ago

Tokazama commented 2 years ago

Simple method for checking if an object has metadata attached.

codecov[bot] commented 2 years ago

Codecov Report

Base: 96.55% // Head: 96.61% // Increases project coverage by +0.05% :tada:

Coverage data is based on head (f93cf74) compared to base (32ef840). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #50 +/- ## ========================================== + Coverage 96.55% 96.61% +0.05% ========================================== Files 1 1 Lines 58 59 +1 ========================================== + Hits 56 57 +1 Misses 2 2 ``` | [Impacted Files](https://codecov.io/gh/JuliaData/DataAPI.jl/pull/50?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaData) | Coverage Δ | | |---|---|---| | [src/DataAPI.jl](https://codecov.io/gh/JuliaData/DataAPI.jl/pull/50/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaData#diff-c3JjL0RhdGFBUEkuamw=) | `96.61% <100.00%> (+0.05%)` | :arrow_up: | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaData). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaData)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

bkamins commented 2 years ago

Two questions:

  1. Should this function mean "this object may have metadata" or
  2. This object actually has metadata

The idea behind the initial design was that point 2 is handled via:

isempty(metadatakeys(obj))

If you want 1. you essentially ask if object has metadata! setter defined. I understand you would find it useful to avoid having to do try-catch blocks around metadata!?

Tokazama commented 2 years ago

Two questions:

  1. Should this function mean "this object may have metadata" or
  2. This object actually has metadata

The idea behind the initial design was that point 2 is handled via:

isempty(metadatakeys(obj))

If you want 1. you essentially ask if object has metadata! setter defined. I understand you would find it useful to avoid having to do try-catch blocks around metadata!?

I basically want to reduce any potential overhead that isempty(metadatakeys(obj)) might have when all I'm doing is querying for the existence of metadata. For Dict this isn't an issue, but if someone decides to use something like ImmutableDict then we have to iterate the entire structure first then look up the key to avoid the try-catch block for a simple get. I know this seems trivial when working with tables, but when working with image metadata it can be hard on constant propagation and inference.

Perhaps a better alternative would be to just go straight to the point and have getmeta(obj, key, default). That and metadatakeys can get you pretty far.

bkamins commented 2 years ago

Do you mean you want to avoid this one allocation:

julia> d = Base.ImmutableDict([i => i for i in 1:10^4]...);

julia> @time isempty(keys(d));
  0.000005 seconds (1 allocation: 16 bytes)

? Doing isempty directly:

julia> @time isempty(d);
  0.000005 seconds

has roughly the same time but does one allocation less.


For a reference let me explain the design in DataFrames.jl that we use. Metadata is stored as Union{Nothing, Dict}. We always make sure that if there is no metadata the relevant field stores nothing. In this way if there is no metadata the query is non allocating, as we first check if this value is nothing and in this case just return ().

Tokazama commented 2 years ago

The performance related bit is mostly motivated by the fact that we don't enforce that the underlying metadata is stored in a physical dictionary. Worst case scenario is that the collection of keys has to be created at run time. I admit that this isn't terribly likely, and if that were the case it's unlikely that runtime performance needs this sort of micro-optimization.

Is the concern the extra overhead of adding new methods here or the implementation? I'm mostly trying to put together tools for propagating metadata in a structured way and figured this was a simple and agreeable place to start.

bkamins commented 2 years ago

Is the concern the extra overhead of adding new methods

This is the concern. I thought to have a minimal set of methods that need to be defined. If we have too many methods there is a risk that packages that implement metadata will not do it correctly.

I am not opposed to adding hasmetadata as I also thought about it, but I do not want to rush with it 😄.

The point is that when you write:

we don't enforce that the underlying metadata is stored in a physical dictionary. Worst case scenario is that the collection of keys has to be created at run time.

This is true and this is what e.g. happens in SubDataFrame in DataFrames.jl (this is a view and it dynamically discovers metadata from the parent DataFrame). My thinking was that:

In summary: what I want is to carefully think if we need this method. I understand that the rationale for adding it is performance only, so let us make sure that in practice it is indeed the case that some packages will need it for performance (i.e. do you have a concrete case where you already know it is needed)?

CC @nalimilan

nalimilan commented 2 years ago

Maybe we can wait until we find an actual implementation where the current reliance on isempty(metadatakeys(x)) isn't efficient enough? Anyway we can easily define hasmetadata(x) = isempty(metadatakeys(x)) at that point and it will be backward-compatible. (Defining hasmetadata(x) = false would be breaking, as existing implementations like DataFrames.jl don't define it even though they have metadata.)

Tokazama commented 2 years ago
  • in general (and this is the case in DataFrames.jl) if we add hasmetadata to SubDataFrame it will still be an O(n) cost - i.e. if it is not possible to list keys cheaply it is not possible to check if there are any keys cheaply (of course there might be cases when the former is expensive and the latter is cheap, but, for instance, in DataFrames.jl this is not the case)

I didn't consider this situation, which would pretty much defeat the point of having something that could be a generic and performant solution. I've skimmed through a lot of those changes in DataFrames.jl but I should probably look at SubDataFrame so I have some more context.

Maybe we can wait until we find an actual implementation where the current reliance on isempty(metadatakeys(x)) isn't efficient enough? Anyway we can easily define hasmetadata(x) = isempty(metadatakeys(x)) at that point and it will be backward-compatible. (Defining hasmetadata(x) = false would be breaking, as existing implementations like DataFrames.jl don't define it even though they have metadata.)

Also a good point. I've been hesitant to dump a bunch of code in a PR or issue, because it gets a bit difficult to see the forest from the trees with all the moving pieces. I was hoping this would help provide further common ground for building some more sophisticated methods, but it might be more damaging than helpful at this point.

Thanks for taking the time to provide a thoughtful review here. Should I open an issue for some of the metadata methods I think are going to be necessary or would it be preferable that I wait until I have a more solid implementation for a PR?

bkamins commented 2 years ago

I think it is better to discuss your proposals now (even without implementation) on a design level, so that when you implement things the design is settled.