Definition of `cmin` property box

MPEGGroup / FileFormat

MPEG file format discussions

24 stars 0 forks source link

Definition of `cmin` property box #102

Open farindk opened 5 months ago

farindk commented 5 months ago

In w22940, the definition of the camera intrinsic matrix property box (cmin) includes the focal length, separated into $f_x$ and $f_y$ and normalized by the image size:

f_x = focal_length_x × image_width / denominator
f_y = focal_length_y × image_height / denominator

Moreover, transmission of f_y is optional and if it is not transmitted, f_y = f_x should be used (Note 1). This is problematic as detailed below.

As a background, the focal length specifies the distance of the image plane (sensor) from the focal center. A general form of the intrinsic matrix is Where $f$ denotes the focal length and $a$ the aspect ratio of pixels, which is 1 for square pixels. (Principal point $p_x$, $p_y$, and skew $s$ are not relevant here.) There is no such thing like a horizontal and vertical focal length, but one can sometimes still come across the f_x, f_y naming scheme when $a\cdot y$ is combined to a $f_y$.

It makes sense to specify the focal length in pixels (unlike the typical 'mm' used in photography) and to normalize it to the image size to keep it independent from the image resolution.

However, the problem is that when the cmin box is read and (flags & 1)==false, one should set f_y=f_x. However, this is not possible without knowing the image size. If a reader wants to output the focal length in unnormalized pixel units, it is clear that it needs to know the image size, but even if the reader wants to output a normalized focal length (originally meant to be independent of the image size), it still has to know the image size to compute focal_length_y according to $f_y=f_x$ as focal_length_y = focal_length_x * image_width / image_height. This means that a cmin box cannot be interpreted independently from a ispe box.

There might be two options to change this:

f_x and f_y could both be normalized to the same image dimension, e.g. image_width. Since f_x, f_y have nothing to do with the image size itself, this should not be a problem.
Instead of sending f_x and f_y, the cmin box instead sends $f$ and $a$. $f$ can be normalized to any image dimension (e.g. image width) and $a$ would simply be 1 if not transmitted in the case (flags & 1)==false.

Both of these options would make it possible to read the cmin box without referencing the ispe box and they also keep the focal length independent of the image resolution.

I assume that the case (flags & 1)==true where f_y and skew and transmitted, is currently not used much, because they are only relevant when there are non-square or skewed pixels. Thus, a non-breaking last-minute change to the definition might still be possible.

leo-barnes commented 5 months ago

Another alternative would be to require cmin to come after ispe. We already do that for transformative properties. That way you are always guaranteed that you can de-normalize when you're parsing.

farindk commented 5 months ago

Another alternative would be to require cmin to come after ispe. We already do that for transformative properties. That way you are always guaranteed that you can de-normalize when you're parsing.

In my view, this would not help much because relying on a specific ordering and passing around the previously decoded boxes would make the implementation rather awkward. It would also somehow negate the intention of having the focal length resolution independent. Such an ordering constraint will also be difficult to enforce as there might be software out there that does not care about cmin and will reorganize or "optimize" the property boxes and thereby shuffle the ordering.

We can solve anything in software. Also the current definition is no showstopper. We can make it work. But a different definition would make things cleaner and less error-prone.

farindk commented 4 months ago

Another alternative would be to require cmin to come after ispe. We already do that for transformative properties. That way you are always guaranteed that you can de-normalize when you're parsing.

Yes, I think that cmin should be restricted to come after ispe to follow the principle that these can be processed in that order. For cmex, that is not required, but it should still come before the transformative properties as irot will imply a modification of cmex if I understand this correctly.

leo-barnes commented 4 months ago

Good comments, I definitely think this should be clarified in the text.

Gut feelings on cmin:

We should add a SHALL that cmin comes after ispe.
The intrinsics are linked to the sensor, so it make most sense for it to be linked to the image prior to any transformative properties. The text already says that the fields are normalized by the ispe fields (irrespective of any transforms), so I think this is already covered but could be explicitly called out.
If an image item has cmin and transformative properties like rotation, a parser may need to apply those to the intrinsics as well when surfacing the metadata to the caller. A 90 degree rotation will flip fx/fy and cx/cy for example.

Gut feelings on cmex: I think the interactions of cmex and irot/imir need to be carefully considered before we impose any restrictions. My immediate thought is that specifying it after transforms makes the most sense at least in some use-cases.

If we start with how current smartphones work, taking a photo in LandscapeLeft orientation or LandscapeRight orientation makes no difference. The images are captured in sensor orientation but get the appropriate irot added so that they display the same.

If the smartphone has multiple sensors and could capture stereo video or stills, I would expect the same to hold. Images are captured in sensor orientation, with the appropriate irot applied so that the images are always displayed the same regardless of if LandscapeLeft or LandscapeRight orientation was used. If we name our sensors A and B, I would expect the stereo pair group to be [A,B] in one orientation and [B,A] in the other. I would also expect the extrinsics to match the information of the stereo pair group. In other words if the group says [A,B], the extrinsics say that A is to the left of B and vice versa. This to me would indicate that the extrinsics should come after irot.

The counterargument would be that you don't want to use this for stereo capture at all and simply want to document your sensor setup. In that case it probably makes most sense to specify it before transformative properties.

farindk commented 4 months ago

The intrinsics are linked to the sensor, so it make most sense for it to be linked to the image prior to any transformative properties. The text already says that the fields are normalized by the ispe fields (irrespective of any transforms), so I think this is already covered but could be explicitly called out.

Yes, that makes sense, and I also would favor this, but see below.

If an image item has cmin and transformative properties like rotation, a parser may need to apply those to the intrinsics as well when surfacing the metadata to the caller. A 90 degree rotation will flip fx/fy and cx/cy for example.

Yes, if the cmin refers to the original sensor, all transformative properties have to update the intrinsics accordingly. However, this is much more complicated.

A 90 degree rotation will flip fx/fy and cx/cy for example.

No, this is wrong.

An imir 90-degree image rotation corresponds to a transformation matrix like this:

Hence, we have to left-multiply this onto the transform chain (E is the extrinsic matrix):

Now if we combine the irot and intrinsic matrices, we get this funny result: Which is clearly not a proper intrinsic matrix anymore.

The right way to handle irot instead would be to multiply the extrinsic matrix by a 90 degree camera rotation. However, on second thought, this is also much more complicated because the irot rotation is not around the optical axis (going through the principal point), but through the plain image center. And the whole thing also interacts with clap. It is way too complicated that I dare to try writing it down here. In the end, irot will induce a very complicated update of both cmex and cmin parameters.

clap and imir are relatively easy to integrate into the intrinsic matrix in the decoder, but irot is a nightmare.

As disallowing any irot transformation is probably also no option, I think the only viable option is to specify cmin and cmex on the final image after all image transformations have been applied. That would remove the burned to calculate correct parameter updates from all the applications, which seems really important to me because I cannot see that this would ever be implemented consistently.

If we would say that cmin and cmex are the parameters after all image transformations, having them normalized to the image size also would finally make sense. Otherwise, if the matrices are specified before the transformations, the very first thing we have to do during decoding is to denormalize them in order to be able to apply the transformations.

farindk commented 4 months ago

Or: one could say that cmin and cmex are on the original image prior to all transformations, and any application working with those parameters has to work with the original image and not the transformed one. However, that would imply that for each image item, the decoder API must be able to output the image either with or without any transformative properties.

leo-barnes commented 4 months ago

@farindk you're right, I didn't fully think through how a rotation would affect the matrix outside of the trivial case of a centered principal point.

In general in HEIF the output of an image item is the input + any transformations. Surfacing image items with orientation not applied so that it can be applied at display time is really a perf optimization and not what the spec tells you should happen. This means that it can never be wrong for a parser to only surface the final output with all transforms applied, which then also strongly indicates that the metadata should describe that output, not some potentially never-surfaced input.

Some more internal discussion with relevant teams also agrees with this. I'll try to write up a contribution for the next meeting, but I think a summary of the discussion would then be:

cmin and cmex should (or shall?) come after all transform properties. I think it makes sense to make it a shall for irot, imir and clap, but you never know what other kind of transform properties get added in the future, so I'm slightly wary with making it a hard shall.
cmin is normalized by the image dimensions. If it comes after the transform properties it is normalized by the image item output dimensions, not the ispe dimensions.
An editor that applies any kind of transform properties to an existing image needs to take care that cmin and cmex may need to be updated in order to still be correct.

farindk commented 4 months ago

cmin and cmex should (or shall?)

The problem with 'should' is that I find it non-trivial how to adapt the parameters. If it is expected by the reading application to modify the parameters, there should at least be a note describing how that should be done.

I spent some time trying to figure out how to change the parameters based on an irot. This is what I came up with: If $E$ is the extrinsic matrix, $C$ the intrinsic matrix, and $T$ the matrix describing the irot by left multiplying onto the display pipeline (as indicated in my comment above), we have this total transformation: $p'=T C E p$ As indicated above, we cannot simply left-multiply T onto C as the result is no upper-triangular matrix. Instead, we could right-multiply C with $T' = C^{-1} T C$. That gives $p' = C T' E p$. Now, in order to distribute $T'$ onto $C$ and $E$, we have to do an RQ-decomposition of $T'$ and multiply the R part (upper triangular) to $C$ and the Q part (rotation) to E.

That's quite complex and error-prone. Thus, I would prefer 'shall' to remove the burden from the reader.

If there is an easier way to modify the matrices that I overlooked, please let us know.

leo-barnes commented 3 months ago

This could not be changed in the same HEIF amendment that introduces the properties since that's already too far along for changes.

The clarifications have instead been added to 3ed AMD1 (latest text here).

Main changes:

Clarify that extrinsics and intrinsics document the output image item (i.e. after orientation is applied)
Clarify that intrinsics are normalized by output image dimensions (which corresponds to ispe if no transform properties are present)
Unfortunately we can't say that cmin/cmex shall come after transform properties due to clause 6.5.1 which explicitly forbids this. A note has been added to the text to call this out.

leo-barnes commented 3 months ago

One other thing to note here just so everyone is on the same page now that this is under discussion and getting implemented:

The cmex property describes the camera pose/viewport (i.e. camera-to-world) rather than the world-to-camera matrix.

While world-to-camera tends to be used more with computer vision use-cases, we felt that camera-to-world made more intuitive sense when you're trying to describe how a number of cameras are positioned relative to each other.

In other words, cmex assumes you do translation before rotation.

farindk commented 3 months ago

In other words, cmex assumes you do translation before rotation.

You can specify this accurately mathematically?

A verbal description can be interpreted in many different ways. For example, I would interpret 'translation before rotation' that the camera, which is initially at the origin is first translated to another position and then rotated around the origin, moving it to a completely different location. This is probably not what you are doing, right?

leo-barnes commented 3 months ago

In other words, cmex assumes you do translation before rotation.

You can specify this accurately mathematically?

A verbal description can be interpreted in many different ways. For example, I would interpret 'translation before rotation' that the camera, which is initially at the origin is first translated to another position and then rotated around the origin, moving it to a completely different location. This is probably not what you are doing, right?

No. I think the text covers the mathematics pretty clearly, I just wanted to call out that cmex is camera-to-world rather than world-to-camera since that is something that has been frequently asked.

The latest public document I can find is this, but it should be very similar/identical to the final text.

In particular these two snippets should cover the mathematics:

farindk commented 3 months ago

Ok, so according to the equations, the "translation before rotation" means that we take the camera at its position in the world coordinate system, move it to the origin by the provided translation vector, and then rotate it at the origin, which makes sense.

So, in fact, you are sending the inverse of the extrinsic matrix.