Open farindk opened 5 months ago
Another alternative would be to require cmin
to come after ispe
. We already do that for transformative properties. That way you are always guaranteed that you can de-normalize when you're parsing.
Another alternative would be to require
cmin
to come afterispe
. We already do that for transformative properties. That way you are always guaranteed that you can de-normalize when you're parsing.
In my view, this would not help much because relying on a specific ordering and passing around the previously decoded boxes would make the implementation rather awkward. It would also somehow negate the intention of having the focal length resolution independent.
Such an ordering constraint will also be difficult to enforce as there might be software out there that does not care about cmin
and will reorganize or "optimize" the property boxes and thereby shuffle the ordering.
We can solve anything in software. Also the current definition is no showstopper. We can make it work. But a different definition would make things cleaner and less error-prone.
Another alternative would be to require
cmin
to come afterispe
. We already do that for transformative properties. That way you are always guaranteed that you can de-normalize when you're parsing.
Yes, I think that cmin
should be restricted to come after ispe
to follow the principle that these can be processed in that order. For cmex
, that is not required, but it should still come before the transformative properties as irot
will imply a modification of cmex
if I understand this correctly.
Good comments, I definitely think this should be clarified in the text.
Gut feelings on cmin
:
cmin
comes after ispe
.ispe
fields (irrespective of any transforms), so I think this is already covered but could be explicitly called out.cmin
and transformative properties like rotation, a parser may need to apply those to the intrinsics as well when surfacing the metadata to the caller. A 90 degree rotation will flip fx/fy and cx/cy for example.Gut feelings on cmex
:
I think the interactions of cmex
and irot
/imir
need to be carefully considered before we impose any restrictions. My immediate thought is that specifying it after transforms makes the most sense at least in some use-cases.
If we start with how current smartphones work, taking a photo in LandscapeLeft orientation or LandscapeRight orientation makes no difference. The images are captured in sensor orientation but get the appropriate irot
added so that they display the same.
If the smartphone has multiple sensors and could capture stereo video or stills, I would expect the same to hold. Images are captured in sensor orientation, with the appropriate irot
applied so that the images are always displayed the same regardless of if LandscapeLeft or LandscapeRight orientation was used. If we name our sensors A and B, I would expect the stereo pair group to be [A,B] in one orientation and [B,A] in the other. I would also expect the extrinsics to match the information of the stereo pair group. In other words if the group says [A,B], the extrinsics say that A is to the left of B and vice versa. This to me would indicate that the extrinsics should come after irot
.
The counterargument would be that you don't want to use this for stereo capture at all and simply want to document your sensor setup. In that case it probably makes most sense to specify it before transformative properties.
- The intrinsics are linked to the sensor, so it make most sense for it to be linked to the image prior to any transformative properties. The text already says that the fields are normalized by the
ispe
fields (irrespective of any transforms), so I think this is already covered but could be explicitly called out.
Yes, that makes sense, and I also would favor this, but see below.
- If an image item has
cmin
and transformative properties like rotation, a parser may need to apply those to the intrinsics as well when surfacing the metadata to the caller. A 90 degree rotation will flip fx/fy and cx/cy for example.
Yes, if the cmin
refers to the original sensor, all transformative properties have to update the intrinsics accordingly.
However, this is much more complicated.
A 90 degree rotation will flip fx/fy and cx/cy for example.
No, this is wrong.
An imir
90-degree image rotation corresponds to a transformation matrix like this:
Hence, we have to left-multiply this onto the transform chain (E is the extrinsic matrix):
Now if we combine the irot
and intrinsic matrices, we get this funny result:
Which is clearly not a proper intrinsic matrix anymore.
The right way to handle irot
instead would be to multiply the extrinsic matrix by a 90 degree camera rotation.
However, on second thought, this is also much more complicated because the irot
rotation is not around the optical axis (going through the principal point), but through the plain image center. And the whole thing also interacts with clap
. It is way too complicated that I dare to try writing it down here. In the end, irot
will induce a very complicated update of both cmex
and cmin
parameters.
clap
and imir
are relatively easy to integrate into the intrinsic matrix in the decoder, but irot
is a nightmare.
As disallowing any irot
transformation is probably also no option, I think the only viable option is to specify cmin
and cmex
on the final image after all image transformations have been applied. That would remove the burned to calculate correct parameter updates from all the applications, which seems really important to me because I cannot see that this would ever be implemented consistently.
If we would say that cmin
and cmex
are the parameters after all image transformations, having them normalized to the image size also would finally make sense. Otherwise, if the matrices are specified before the transformations, the very first thing we have to do during decoding is to denormalize them in order to be able to apply the transformations.
Or: one could say that cmin
and cmex
are on the original image prior to all transformations, and any application working with those parameters has to work with the original image and not the transformed one.
However, that would imply that for each image item, the decoder API must be able to output the image either with or without any transformative properties.
@farindk you're right, I didn't fully think through how a rotation would affect the matrix outside of the trivial case of a centered principal point.
In general in HEIF the output of an image item is the input + any transformations. Surfacing image items with orientation not applied so that it can be applied at display time is really a perf optimization and not what the spec tells you should happen. This means that it can never be wrong for a parser to only surface the final output with all transforms applied, which then also strongly indicates that the metadata should describe that output, not some potentially never-surfaced input.
Some more internal discussion with relevant teams also agrees with this. I'll try to write up a contribution for the next meeting, but I think a summary of the discussion would then be:
cmin
and cmex
should (or shall?) come after all transform properties. I think it makes sense to make it a shall for irot
, imir
and clap
, but you never know what other kind of transform properties get added in the future, so I'm slightly wary with making it a hard shall.cmin
is normalized by the image dimensions. If it comes after the transform properties it is normalized by the image item output dimensions, not the ispe
dimensions.cmin
and cmex
may need to be updated in order to still be correct.
- cmin and cmex should (or shall?)
The problem with 'should' is that I find it non-trivial how to adapt the parameters. If it is expected by the reading application to modify the parameters, there should at least be a note describing how that should be done.
I spent some time trying to figure out how to change the parameters based on an irot
.
This is what I came up with:
If $E$ is the extrinsic matrix, $C$ the intrinsic matrix, and $T$ the matrix describing the irot
by left multiplying onto the display pipeline (as indicated in my comment above), we have this total transformation:
$p'=T C E p$
As indicated above, we cannot simply left-multiply T onto C as the result is no upper-triangular matrix.
Instead, we could right-multiply C with $T' = C^{-1} T C$.
That gives $p' = C T' E p$.
Now, in order to distribute $T'$ onto $C$ and $E$, we have to do an RQ-decomposition of $T'$ and multiply the R part (upper triangular) to $C$ and the Q part (rotation) to E.
That's quite complex and error-prone. Thus, I would prefer 'shall' to remove the burden from the reader.
If there is an easier way to modify the matrices that I overlooked, please let us know.
This could not be changed in the same HEIF amendment that introduces the properties since that's already too far along for changes.
The clarifications have instead been added to 3ed AMD1 (latest text here).
Main changes:
ispe
if no transform properties are present)cmin
/cmex
shall come after transform properties due to clause 6.5.1 which explicitly forbids this. A note has been added to the text to call this out.One other thing to note here just so everyone is on the same page now that this is under discussion and getting implemented:
The cmex
property describes the camera pose/viewport (i.e. camera-to-world) rather than the world-to-camera matrix.
While world-to-camera tends to be used more with computer vision use-cases, we felt that camera-to-world made more intuitive sense when you're trying to describe how a number of cameras are positioned relative to each other.
In other words, cmex
assumes you do translation before rotation.
In other words,
cmex
assumes you do translation before rotation.
You can specify this accurately mathematically?
A verbal description can be interpreted in many different ways. For example, I would interpret 'translation before rotation' that the camera, which is initially at the origin is first translated to another position and then rotated around the origin, moving it to a completely different location. This is probably not what you are doing, right?
In other words,
cmex
assumes you do translation before rotation.You can specify this accurately mathematically?
A verbal description can be interpreted in many different ways. For example, I would interpret 'translation before rotation' that the camera, which is initially at the origin is first translated to another position and then rotated around the origin, moving it to a completely different location. This is probably not what you are doing, right?
No. I think the text covers the mathematics pretty clearly, I just wanted to call out that cmex
is camera-to-world rather than world-to-camera since that is something that has been frequently asked.
The latest public document I can find is this, but it should be very similar/identical to the final text.
In particular these two snippets should cover the mathematics:
Ok, so according to the equations, the "translation before rotation" means that we take the camera at its position in the world coordinate system, move it to the origin by the provided translation vector, and then rotate it at the origin, which makes sense.
So, in fact, you are sending the inverse of the extrinsic matrix.
In w22940, the definition of the camera intrinsic matrix property box (
cmin
) includes the focal length, separated into $f_x$ and $f_y$ and normalized by the image size:Moreover, transmission of f_y is optional and if it is not transmitted, f_y = f_x should be used (Note 1). This is problematic as detailed below.
As a background, the focal length specifies the distance of the image plane (sensor) from the focal center. A general form of the intrinsic matrix is Where $f$ denotes the focal length and $a$ the aspect ratio of pixels, which is 1 for square pixels. (Principal point $p_x$, $p_y$, and skew $s$ are not relevant here.) There is no such thing like a horizontal and vertical focal length, but one can sometimes still come across the f_x, f_y naming scheme when $a\cdot y$ is combined to a $f_y$.
It makes sense to specify the focal length in pixels (unlike the typical 'mm' used in photography) and to normalize it to the image size to keep it independent from the image resolution.
However, the problem is that when the
cmin
box is read and(flags & 1)==false
, one should set f_y=f_x. However, this is not possible without knowing the image size. If a reader wants to output the focal length in unnormalized pixel units, it is clear that it needs to know the image size, but even if the reader wants to output a normalized focal length (originally meant to be independent of the image size), it still has to know the image size to computefocal_length_y
according to $f_y=f_x$ asfocal_length_y = focal_length_x * image_width / image_height
. This means that acmin
box cannot be interpreted independently from aispe
box.There might be two options to change this:
cmin
box instead sends $f$ and $a$. $f$ can be normalized to any image dimension (e.g. image width) and $a$ would simply be 1 if not transmitted in the case(flags & 1)==false
.Both of these options would make it possible to read the
cmin
box without referencing theispe
box and they also keep the focal length independent of the image resolution.I assume that the case
(flags & 1)==true
where f_y and skew and transmitted, is currently not used much, because they are only relevant when there are non-square or skewed pixels. Thus, a non-breaking last-minute change to the definition might still be possible.