Open blairmacintyre opened 4 years ago
Correct - for initial prototype, we wanted to explore the best way to surface the required information cross-processes internally to see if we run into any issues & check what are the constraints we have. The API shape is still very much TBD (there are other issues with it as well related to the texture's lifetime). I took a look at the article, I really like the idea of using an Anchor to work around the asynchronous access issue. Let me see how to best incorporate it in the proposal.
HoloLens 2's native support for this kind of thing is generally done with the developer starting from the media API to capture photos/videos, and we then annotate individual frames with the view/projection matrices.
The front-facing photo/video camera on HoloLens 2 happens to have the same framerate as the app's render rate, though there is no promise that camera frames arrive synchronized with the time at which we wake up the app. In fact, because of forward head prediction, the target pose of the views in any given frame could never match the pose at which a camera frame is being captured, since the head is not there yet!
We have an interesting design problem here to align an API across two scenarios:
XRFrame
.For the latter, I would expect to augment something like ImageCapture.grabFrame()
, perhaps with a WebXR method that lets the app get a view transform and projection matrix for the resulting ImageBitmap
and its implicit timestamp:
partial interface XRSession {
XRCameraPose? getCameraPose(ImageBitmap bitmap);
};
[SecureContext, Exposed=Window] interface XRCameraPose {
readonly attribute Float32Array projectionMatrix;
[SameObject] readonly attribute XRRigidTransform transform;
};
We would likely need an additional mechanism when the app spins up that MediaStream
to turn camera poses on, so that WebXR can start flowing camera poses to/from the WebXR process for that MediaStream
.
Do we think it would be possible to come up with an API shape that caters well both synchronous and asynchronous scenarios? It seems to me that the synchronous scenario could fit into the asynchronous model but would be weakened by it (in sync model we deliver a frame within rAFcb & the app could use it when rendering the scene since the frame would be animated).
In general, it seems that we have 3 cases regarding frame rate:
We also have to consider the time delta between the camera and animation frame for which that camera frame is relevant:
delta = frame.time - cameraImage.time
Can that delta ever be negative, i.e. is it ever possible for the animation frame to be delivered before we would be able to deliver the camera image to the app? If "no", then converting from asynchronous model to synchronous API shape seems to be possible - we would always deliver the most recent camera image that we have to the rAFcb. For case 2) it means we're dropping some camera frames, and for case 3) it means some rAFcbs won't get a camera image. We will also need to somehow communicate the delta
to the app, and in synchronous scenario it would be 0.
If "no", (...)
Based on today's IW call, looks like that's actually a "yes" and I did not understand your comment about forward head prediction.
Pinging @manishearth, @blairmacintyre and @cabanier who may have thoughts here...
There are two key kinds of scenarios I know of here for these images on AR devices:
"alpha-blend"
)In practice, apps doing rendering effects really need to ensure they condition those effects to only kick in for views with an environment blend mode of "alpha-blend"
(i.e. the primary view on AR phones/tablets, the secondary MRC view on HoloLens), and for best performance will need some way to signal to the UA that they have fully rendered the background already, and so no UA composition of the background image is required.
Given that, perhaps it's not so bad for rendering effects and computer vision to be handled two different ways in the API?
session.getCameraPose(imageCapture.grabFrame())
(after opting in to have your MediaStream
be "WebXR-compatible" or such)binding.getCameraImage(view.camera)
Each scenario is then consistent across the devices where it applies.
Thoughts?
Is it fair to say that the main problem with the currently proposed API is that it may force some implementations to deliver camera images that are slightly out of date, but since they are available on XRFrame
, the apps could fall into a trap & assume that those images align with the viewer pose / views? The goal here was to provide a minimal API shape that would enable the smartphone use case, but I was hoping that the introduced interfaces could be extended to make the pose / timing mismatch explicit. Strawperson idea:
partial interface XRFrame {
// Contains all the cameras, including the ones that are already
// exposed on XRViews (those would align exactly with the XRViews).
readonly attribute FrozenArray<XRCamera> cameras;
XRCameraPose getCameraPose(XRCamera camera, XRReferenceSpace space);
};
interface XRCameraPose : XRPose{
// Interits all the goodness from XRPose, including velocities.
// For cameras exposed on the views, the pose relative to viewer space
// pose would be identity, and `time` matches the XRFrame time.
readonly attribute DOMHighResTimeStamp time;
};
This would be a non-breaking change - the existing users of raw camera access API would use the XRView variant that offers stronger guarantees about timings (the XRCameraPose.time
would match XRFrame.time
) and poses & projection matrices (the camera image matches the XRView), while also hopefully catering for both scenarios. The problem that I see here is that there may be cases where we'll be delivering camera images a bit later than we potentially could (but IIUC the comment about the forward head pose prediction right, that may not actually be a problem for HoloLens).
Is it fair to say that the main problem with the currently proposed API is that it may force some implementations to deliver camera images that are slightly out of date, but since they are available on XRFrame, the apps could fall into a trap & assume that those images align with the viewer pose / views?
On headsets, it would be wrong to use the primary view view and projection matrix - the app must use not just the camera-specific view pose, but also a camera-specific projection matrix.
For example, HoloLens 2 has a wildly different photo/video camera FOV vs. primary view FOV. If the app does any rendering or CV on the camera image using the primary view's FOV, it will be completely wrong, and so the app would also need to grab a projection matrix from XRCameraPose
.
For devices where the system's XR render cadence and the camera's capture cadence accidentally align, the approach you suggest above could work if we add projectionMatrix
to XRCameraPose
. However, there are still some gotchas:
One interesting thing to note about your latest proposal here is that it still offers apps two alternate paths to the XRCameraPose
type, one through a given XRView
and one through separate enumeration of an XRCamera
. If we are comfortable offering two paths, perhaps it is OK for them to differ a bit more given the differences noted above:
//////
// For temporally and spatially view-aligned camera images:
//////
partial interface XRView {
// Non-null iff there exists an associated camera that perfectly aligns with the view:
[SameObject] readonly attribute XRCamera? camera;
};
interface XRCamera {
// Dimensions of the camera image:
readonly attribute long width;
readonly attribute long height;
};
partial interface XRWebGLBinding {
// Access to the camera texture itself:
WebGLTexture? getCameraImage(XRCamera camera);
};
// TBD mechanism for app to opt out of automatic environment underlay if app is rendering full-screen effect already
//////
// For asynchronous media-based camera images:
//////
// Mechanism for app to opt into XR hologram overlay and/or XR poses
// for a given XRSession during MediaDevices.getUserMedia(constraints):
dictionary XRMediaTrackConstraints : XRMediaTrackConstraintSet {
sequence<XRMediaTrackConstraintSet> advanced;
};
dictionary XRMediaTrackConstraintSet {
// Enable secondary view for XRSession to enable system-composited hologram overlay:
ConstrainBoolean xrSecondaryViewOverlay;
// Enable getting per-frame camera view/projection matrices:
ConstrainBoolean xrPoses;
};
partial interface XRSession {
XRCameraPose? getCameraPose(ImageBitmap bitmap);
};
interface XRCameraPose {
readonly attribute Float32Array projectionMatrix;
[SameObject] readonly attribute XRRigidTransform transform;
};
With either your proposal or this proposal, an app that just powers ahead to fetch view.camera
will only work on smartphones and tablets anyway, and so we'd still have sites that use this API failing to fetch camera images on headsets.
With your proposal, we could encourage engines to generally ignore the per-view approach and enumerate frame.cameras
instead, which would support headsets too. However, let's dig in on whether that would be a good or bad thing for app compat across devices:
view.camera
is non-null.MediaDevices.getUserMedia
approach to reduce latency on headsets. I am not clear whether apps doing computer vision rather than primary view rendering otherwise benefit from the association of a given camera image with a given XRFrame
.MediaDevices.getUserMedia
to configure the camera settings and capture that photo or video. Manually encoding a video with overlaid holograms from per-frame WebXR camera images and a separately captured audio track is likely to be highly non-trivial. The easiest path to synchronized video and audio with overlaid holograms is likely to just have the app opt into some xrSecondaryViewOverlay
media track constraint that would then enable a WebXR secondary view for that camera, with the system then taking care of the hologram overlay. (see the HoloLens Mixed Reality Capture opt-in docs for more info about the MixedRealityCaptureVideoEffect
apps can add to a Media Foundation stream to record a video with correctly-positioned hologram overlays)Generally, I'm a huge fan of us fully unifying a given WebXR scenario across phones, tablets and headsets to enable maximum content compatibility! However, across the three scenarios above, only scenario 1 benefits from the per-XRFrame
design, and that scenario is not relevant for headsets (unless we supported it for secondary views for optional render effects during scenario 3). For scenarios 2 and 3, the simplest and most robust path that is most aligned with non-XR computer vision would be the xrSecondaryViewOverlay
and/or xrPoses
constraints that can be used with MediaDevices.getUserMedia
, which UAs could then support across phones, tablets and headsets.
(hi @thetuvix @bialpio!)
At one point in the distant past, we talked about having an async API for delivering frames, but explicitly guaranteeing that if the frames were synchronous with the view (smartphones) that would be delivered before the rAFcb ... each frame would have "all the info" (timestamp, view, projection, ability to request gl texture or bytes). There would likely need to be a way for the app to determine this was happening (e.g., a capability or property?).
It has it's downsides, but it might simplify things.
Pinging @Manishearth, @blairmacintyre and @cabanier who may have thoughts here... .. Thoughts?
Given how many variables the UA and the author need to account for (framerate, fov, camera framerate, camera characteristics, etc), wouldn't it be better to not spend too much time on raw camera access but instead build out native APIs for computer vision? I worry that given all the variables authors will just test on popular devices and this might give smaller vendors a disadvantage.
In addition, the number one complaint with WebXR is performance. If we push CV and sync compositing to Javascript, I doubt that authors can build good experiences on immersive devices that are built on mobile technologies.
Based on the discussion, I think the best way to proceed would be to plan on having 2 distinct API shapes that are split based on the scenarios they are solving. In Chrome for Android, we will pursue the per-XRFrame
API in a way that I have roughly outlined in the explainer, and I will create an initial specification draft that covers only this variant. This will hopefully avoid the feedback about having a specification with no implementations (as the gUM() variant would not be implemented by anyone). Additionally, for smartphone AR, we may be able to provide the camera access API via the gUM() route at a later time as well, depending on the technical difficulty and the bandwidth we have available.
(...) wouldn't it be better to not spend too much time on raw camera access but instead build out native APIs for computer vision?
I think this was largely covered during the last IW call - the need for a way of accessing the camera API is there and the consumers of the API seem to be willing to pay the performance cost for this. That is not to say that we can't work on the native APIs for CV, but the problem with that approach is that without the camera access API, we're preventing people from innovating and trying various things that we may not even think about now (I seem to recall "shoe detection" being brought up? :smiley:). As a bonus, we could influence what we're working on based on what solutions crop up in the wild - if it turns out some scenarios / algorithms are very common, we can try to make them a part of the platform.
I think this was largely covered during the last IW call - the need for a way of accessing the camera API is there and the consumers of the API seem to be willing to pay the performance cost for this. That is not to say that we can't work on the native APIs for CV, but the problem with that approach is that without the camera access API, we're preventing people from innovating and trying various things that we may not even think about now (I seem to recall "shoe detection" being brought up? 😃). As a bonus, we could influence what we're working on based on what solutions crop up in the wild - if it turns out some scenarios / algorithms are very common, we can try to make them a part of the platform.
I see. So is this just an API for experimentation and not meant to ship out to the general public?
It is meant to be shipped out to the general public, who will then amaze us with what they are able to come up with, and hopefully will give us a chance to learn from those experiments & influence what CV algorithms we can then attempt to standardize. At this point, based on what @nbutko shared during the last IW call, I'm worried that we're hurting the developers by not giving them access to the camera pixels in a way that could be decorated with information coming from WebXR. And, given that we prioritized & launched the privacy-preserving APIs first, the developers should be able to pick those and only choose raw camera access API once they are forced to.
Now that the discussion slowed down a bit and the dust settled, I think this will be a good moment to archive this repository and move future discussions to https://github.com/immersive-web/raw-camera-access/ repo.
Devices that do not have a 1:1 mapping of camera frames to XR frames cannot implement this synchronous API approach, so it's only implementable on handheld AR (e.g., phones, tablets).
VR and AR HMDs that have cameras do not run the camera at the same frame rate as the XR API.
See, for example, the discussion of the implementation we did 2 years ago of an asych API in the WebXR Viewer (https://blog.mozvr.com/experimenting-with-computer-vision-in-webxr/).
Also, see the discussion of why asynchronous access is needed in the CV repo you link to in the explainer. There is nothing to prevent promises from resolving immediately on a handheld platform, but any WebXR camera API needs to support all devices.