Open ehogan opened 2 years ago
@ehogan Under the hood for metadata attributes
dictionary comparisons we use python-xxhash
(see xxHash).
The fact that you can put scalar numpy
arrays into the dictionary is the issue here. Although np.float(89.5)
is equivalent to 89.5
they are not the same thing, and we've chosen to discriminate based on that.
This is why assert cube1.metadata == cube2.metadata
fails, and the difference
method is consistent with that behaviour and returns that there is indeed a difference between the attributes
dictionaries.
Perhaps it might be more helpful to the user if the difference
method highlighted this explicitly :thinking:
You might reasonably ask why oh why do we care about such a difference?
... one concrete example is for cube arithmetic when the metadata for two operand cubes are combined into one resultant cube. Given your example above, which version of attributes["value1"]
should be used in the resultant cubes metadata? The decision is not clear, hence the difference does kinda matter.
Note that, cube1.attributes == cube2.attributes
is a simple Python dictionary comparison... and in this case gives a different result, as the Python builtin comparison being performed is different.
I can't imagine that we'll subclass a Dict
and specialise the __eq__
behaviour to align with metadata __eq__
.
At the very least this behaviour should be highlighted and clarified... but do note that you are highlighting a special case here of using a numpy
scalar array. Your example doesn't extend to non-scalar numpy
arrays e.g.,
>>> import numpy as np
>>> a = dict(values=np.arange(3))
>>> b = dict(values=[0, 1, 2])
>>> a == b
a == b
Traceback (most recent call last):
File "<input>", line 1, in <module>
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Also, note that metadata combine
, compare
and difference
behaviours avoid this type of ValueError
due to the approach that we've adopted for special treatment of dealing with attributes
dictionaries.
@ehogan Note also that his change in behaviour was introduced into iris
3.0.0, and was one of the many reasons for a major release of iris
i.e., it breaks backwards compatibility for some behaviours. Reference #3785
And here is view from someone that all to frequently (to say the least...) have to deal with all sorts of metadata (and data) imperfections. Model data, that is supposed to be one homogeneous set, that was produced in different batches, on different machines, quality controlled by different people, using different [versions of] software tools, and good know what is the reason behind the imperfections. In the best of worlds such problems should be non-existing or data just rejected/fixed by the producers. But I am afraid that I do not live in that world.
Last year we processed something like 10,000-15,000 files from a large online database, and something like 60% had metadata (or data) imperfections. Most of these were easy to spot and could be fixed with a few steps of preprocessing. Remained something like 500 files making up about 30-50 supposedly homogeneous dataset units. Homogeneous in the sense that my_cube_list.concatenate_cube()
should go down just like a sip of coffee. As it was impossible to imagine where these imperfections where, in in which file, in which metadata attribute, or which file(s) had coordinates that differed in LSB, trial and error was the method. Not preferred but at hand. A lot of coffee were used before that smooth sip finally went down.
So, what conclusions to to draw from this with any bearings on iris?
np.float(89.5)
or 89.5
?I think that steps in this direction will be immensely useful all and everyone that are dealing with data from various sources.
@larsbarring have you looked at cube_helper? I've not used it myself but think it was written for CMIP ensembles. I think we're all agreed that Iris's own merge
and concatenate
functions should be more lenient though.
Thanks @bjlittle :) I'm happy with the discrimination between np.float(89.5)
and 89.5
. I do think it would be super helpful if the output showed the types of the values, as it is not clear what the difference is in the output (the values look the same to me!). So instead of:
CubeMetadata(attributes=({'value1': 89.5}, {'value1': 89.5}))
could the output be:
CubeMetadata(attributes=({'value1': np.float64(89.5)}, {'value1': 89.5}))
or similar?
Also, would it be worth adding to the documentation that assert cube1.attributes == cube2.attributes
doesn't distinguish between types in the same way as cube1.metadata.difference(cube2.metadata)
? If this is something you think is worth doing I'd be happy to have a go at a PR? Or, alternatively, I'd be happy to review something? :)
In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.
If this issue is still important to you, then please comment on this issue and the stale label will be removed.
Otherwise this issue will be automatically closed in 28 days time.
If this issue is still important to you, then please comment on this issue and the stale label will be removed.
This issue is still important to me π
π Bug Report
I have two cubes. When I use
cube1.metadata.difference(cube2.metadata)
my metadata shows differences, even though the values look the same. However, when I useassert cube1.attributes == cube2.attributes
I get noAssertionError
.I discovered this issue because:
scitools/default-previous
) I can runassert cube1.metadata == cube2.metadata
and I get noAssertionError
.scitools/default-current
) I get anAssertionError
when I runassert cube1.metadata == cube2.metadata
.How To Reproduce
Expected behaviour
I'm not sure what would be best?
Either
cube1.metadata.difference(cube2.metadata)
shows the types in the difference, so it's clear where the difference is (because the values look the same and this took longer than I care to admit to figure out) πOr
assert cube1.attributes == cube2.attributes
failsOr both?