Closed Tokazama closed 2 years ago
Base: 96.55% // Head: 96.61% // Increases project coverage by +0.05%
:tada:
Coverage data is based on head (
f93cf74
) compared to base (32ef840
). Patch coverage: 100.00% of modified lines in pull request are covered.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Two questions:
The idea behind the initial design was that point 2 is handled via:
isempty(metadatakeys(obj))
If you want 1. you essentially ask if object has metadata!
setter defined. I understand you would find it useful to avoid having to do try
-catch
blocks around metadata!
?
Two questions:
- Should this function mean "this object may have metadata" or
- This object actually has metadata
The idea behind the initial design was that point 2 is handled via:
isempty(metadatakeys(obj))
If you want 1. you essentially ask if object has
metadata!
setter defined. I understand you would find it useful to avoid having to dotry
-catch
blocks aroundmetadata!
?
I basically want to reduce any potential overhead that isempty(metadatakeys(obj))
might have when all I'm doing is querying for the existence of metadata. For Dict
this isn't an issue, but if someone decides to use something like ImmutableDict
then we have to iterate the entire structure first then look up the key to avoid the try
-catch
block for a simple get. I know this seems trivial when working with tables, but when working with image metadata it can be hard on constant propagation and inference.
Perhaps a better alternative would be to just go straight to the point and have getmeta(obj, key, default)
. That and metadatakeys
can get you pretty far.
Do you mean you want to avoid this one allocation:
julia> d = Base.ImmutableDict([i => i for i in 1:10^4]...);
julia> @time isempty(keys(d));
0.000005 seconds (1 allocation: 16 bytes)
?
Doing isempty
directly:
julia> @time isempty(d);
0.000005 seconds
has roughly the same time but does one allocation less.
For a reference let me explain the design in DataFrames.jl that we use.
Metadata is stored as Union{Nothing, Dict}
. We always make sure that if there is no metadata the relevant field stores nothing
. In this way if there is no metadata the query is non allocating, as we first check if this value is nothing
and in this case just return ()
.
The performance related bit is mostly motivated by the fact that we don't enforce that the underlying metadata is stored in a physical dictionary. Worst case scenario is that the collection of keys has to be created at run time. I admit that this isn't terribly likely, and if that were the case it's unlikely that runtime performance needs this sort of micro-optimization.
Is the concern the extra overhead of adding new methods here or the implementation? I'm mostly trying to put together tools for propagating metadata in a structured way and figured this was a simple and agreeable place to start.
Is the concern the extra overhead of adding new methods
This is the concern. I thought to have a minimal set of methods that need to be defined. If we have too many methods there is a risk that packages that implement metadata will not do it correctly.
I am not opposed to adding hasmetadata
as I also thought about it, but I do not want to rush with it 😄.
The point is that when you write:
we don't enforce that the underlying metadata is stored in a physical dictionary. Worst case scenario is that the collection of keys has to be created at run time.
This is true and this is what e.g. happens in SubDataFrame
in DataFrames.jl (this is a view and it dynamically discovers metadata from the parent DataFrame
). My thinking was that:
hasmetadata
to SubDataFrame
it will still be an O(n) cost - i.e. if it is not possible to list keys cheaply it is not possible to check if there are any keys cheaply (of course there might be cases when the former is expensive and the latter is cheap, but, for instance, in DataFrames.jl this is not the case)In summary: what I want is to carefully think if we need this method. I understand that the rationale for adding it is performance only, so let us make sure that in practice it is indeed the case that some packages will need it for performance (i.e. do you have a concrete case where you already know it is needed)?
CC @nalimilan
Maybe we can wait until we find an actual implementation where the current reliance on isempty(metadatakeys(x))
isn't efficient enough? Anyway we can easily define hasmetadata(x) = isempty(metadatakeys(x))
at that point and it will be backward-compatible. (Defining hasmetadata(x) = false
would be breaking, as existing implementations like DataFrames.jl don't define it even though they have metadata.)
- in general (and this is the case in DataFrames.jl) if we add
hasmetadata
toSubDataFrame
it will still be an O(n) cost - i.e. if it is not possible to list keys cheaply it is not possible to check if there are any keys cheaply (of course there might be cases when the former is expensive and the latter is cheap, but, for instance, in DataFrames.jl this is not the case)
I didn't consider this situation, which would pretty much defeat the point of having something that could be a generic and performant solution. I've skimmed through a lot of those changes in DataFrames.jl but I should probably look at SubDataFrame so I have some more context.
Maybe we can wait until we find an actual implementation where the current reliance on
isempty(metadatakeys(x))
isn't efficient enough? Anyway we can easily definehasmetadata(x) = isempty(metadatakeys(x))
at that point and it will be backward-compatible. (Defininghasmetadata(x) = false
would be breaking, as existing implementations like DataFrames.jl don't define it even though they have metadata.)
Also a good point. I've been hesitant to dump a bunch of code in a PR or issue, because it gets a bit difficult to see the forest from the trees with all the moving pieces. I was hoping this would help provide further common ground for building some more sophisticated methods, but it might be more damaging than helpful at this point.
Thanks for taking the time to provide a thoughtful review here. Should I open an issue for some of the metadata methods I think are going to be necessary or would it be preferable that I wait until I have a more solid implementation for a PR?
I think it is better to discuss your proposals now (even without implementation) on a design level, so that when you implement things the design is settled.
Simple method for checking if an object has metadata attached.