how would this work on non-video-mixed devices?

blairmacintyre commented 4 years ago

Devices that do not have a 1:1 mapping of camera frames to XR frames cannot implement this synchronous API approach, so it's only implementable on handheld AR (e.g., phones, tablets).

VR and AR HMDs that have cameras do not run the camera at the same frame rate as the XR API.

See, for example, the discussion of the implementation we did 2 years ago of an asych API in the WebXR Viewer (https://blog.mozvr.com/experimenting-with-computer-vision-in-webxr/).

Also, see the discussion of why asynchronous access is needed in the CV repo you link to in the explainer. There is nothing to prevent promises from resolving immediately on a handheld platform, but any WebXR camera API needs to support all devices.

bialpio commented 4 years ago

Correct - for initial prototype, we wanted to explore the best way to surface the required information cross-processes internally to see if we run into any issues & check what are the constraints we have. The API shape is still very much TBD (there are other issues with it as well related to the texture's lifetime). I took a look at the article, I really like the idea of using an Anchor to work around the asynchronous access issue. Let me see how to best incorporate it in the proposal.

thetuvix commented 3 years ago

HoloLens 2's native support for this kind of thing is generally done with the developer starting from the media API to capture photos/videos, and we then annotate individual frames with the view/projection matrices.

The front-facing photo/video camera on HoloLens 2 happens to have the same framerate as the app's render rate, though there is no promise that camera frames arrive synchronized with the time at which we wake up the app. In fact, because of forward head prediction, the target pose of the views in any given frame could never match the pose at which a camera frame is being captured, since the head is not there yet!

We have an interesting design problem here to align an API across two scenarios:

Synchronously providing a camera frame that aligns exactly with a given XRFrame.
Asynchronously allowing frames to arrive at the cadence/phase of a photo/video camera independent from the app's render frames.

For the latter, I would expect to augment something like ImageCapture.grabFrame(), perhaps with a WebXR method that lets the app get a view transform and projection matrix for the resulting ImageBitmap and its implicit timestamp:

partial interface XRSession {
  XRCameraPose? getCameraPose(ImageBitmap bitmap);
};

[SecureContext, Exposed=Window] interface XRCameraPose {
  readonly attribute Float32Array projectionMatrix;
  [SameObject] readonly attribute XRRigidTransform transform;
};

We would likely need an additional mechanism when the app spins up that MediaStream to turn camera poses on, so that WebXR can start flowing camera poses to/from the WebXR process for that MediaStream.

bialpio commented 3 years ago

Do we think it would be possible to come up with an API shape that caters well both synchronous and asynchronous scenarios? It seems to me that the synchronous scenario could fit into the asynchronous model but would be weakened by it (in sync model we deliver a frame within rAFcb & the app could use it when rendering the scene since the frame would be animated).

In general, it seems that we have 3 cases regarding frame rate:

Camera frames and animation frames arrive w/ the same frequency.
Camera has the higher FPS than animation frame loop.
Animation frame loop has higher FPS than camera.

We also have to consider the time delta between the camera and animation frame for which that camera frame is relevant:
delta = frame.time - cameraImage.time

Can that delta ever be negative, i.e. is it ever possible for the animation frame to be delivered before we would be able to deliver the camera image to the app? If "no", then converting from asynchronous model to synchronous API shape seems to be possible - we would always deliver the most recent camera image that we have to the rAFcb. For case 2) it means we're dropping some camera frames, and for case 3) it means some rAFcbs won't get a camera image. We will also need to somehow communicate the delta to the app, and in synchronous scenario it would be 0.

bialpio commented 3 years ago

If "no", (...)

Based on today's IW call, looks like that's actually a "yes" and I did not understand your comment about forward head prediction.

thetuvix commented 3 years ago

Pinging @manishearth, @blairmacintyre and @cabanier who may have thoughts here...

There are two key kinds of scenarios I know of here for these images on AR devices:

Rendering effects on the background image (e.g. prisms, full-screen tinting, etc.), requiring the app to render the background themselves manually (only meaningful when the environment blend mode is "alpha-blend")
Computer vision on the camera image to find world-space features in the scene that can then be tracked using standard 6DoF device tracking (applicable to any environment blend mode - does this CV ever rely heavily on being synchronous with the render image?)

In practice, apps doing rendering effects really need to ensure they condition those effects to only kick in for views with an environment blend mode of "alpha-blend" (i.e. the primary view on AR phones/tablets, the secondary MRC view on HoloLens), and for best performance will need some way to signal to the UA that they have fully rendered the background already, and so no UA composition of the background image is required.

Given that, perhaps it's not so bad for rendering effects and computer vision to be handled two different ways in the API?

If you're doing CV on a phone/tablet/headset: use an API like session.getCameraPose(imageCapture.grabFrame()) (after opting in to have your MediaStream be "WebXR-compatible" or such)
If you're doing render effects on a phone/tablet: use an API like binding.getCameraImage(view.camera)

Each scenario is then consistent across the devices where it applies.

Thoughts?

bialpio commented 3 years ago

Is it fair to say that the main problem with the currently proposed API is that it may force some implementations to deliver camera images that are slightly out of date, but since they are available on XRFrame, the apps could fall into a trap & assume that those images align with the viewer pose / views? The goal here was to provide a minimal API shape that would enable the smartphone use case, but I was hoping that the introduced interfaces could be extended to make the pose / timing mismatch explicit. Strawperson idea:

partial interface XRFrame {
  // Contains all the cameras, including the ones that are already
  // exposed on XRViews (those would align exactly with the XRViews).
  readonly attribute FrozenArray<XRCamera> cameras;

  XRCameraPose getCameraPose(XRCamera camera, XRReferenceSpace space);
};

interface XRCameraPose : XRPose{
  // Interits all the goodness from XRPose, including velocities.
  // For cameras exposed on the views, the pose relative to viewer space
  // pose would be identity, and `time` matches the XRFrame time.

  readonly attribute DOMHighResTimeStamp time;
};

This would be a non-breaking change - the existing users of raw camera access API would use the XRView variant that offers stronger guarantees about timings (the XRCameraPose.time would match XRFrame.time) and poses & projection matrices (the camera image matches the XRView), while also hopefully catering for both scenarios. The problem that I see here is that there may be cases where we'll be delivering camera images a bit later than we potentially could (but IIUC the comment about the forward head pose prediction right, that may not actually be a problem for HoloLens).

thetuvix commented 3 years ago

Is it fair to say that the main problem with the currently proposed API is that it may force some implementations to deliver camera images that are slightly out of date, but since they are available on XRFrame, the apps could fall into a trap & assume that those images align with the viewer pose / views?

On headsets, it would be wrong to use the primary view view and projection matrix - the app must use not just the camera-specific view pose, but also a camera-specific projection matrix.

For example, HoloLens 2 has a wildly different photo/video camera FOV vs. primary view FOV. If the app does any rendering or CV on the camera image using the primary view's FOV, it will be completely wrong, and so the app would also need to grab a projection matrix from XRCameraPose.

For devices where the system's XR render cadence and the camera's capture cadence accidentally align, the approach you suggest above could work if we add projectionMatrix to XRCameraPose. However, there are still some gotchas:

Tying delivery of camera frames to rAFcb will incur between 0.0 and 1.0 frames of additional latency for processing the camera image, depending on how lucky you get with the phase of rAFcb vs. the camera. This might be fine for some scenarios (e.g. snapping a photo with holograms overlaid), though using this to power vision-based tracking (e.g. tracking a moving marker) could result in unnecessarily worse tracking quality for that CV algorithm.
This presumes that the device's XR render rate accidentally matches the photo/video camera's capture rate. If the camera capture rate was lower, we'd need some mechanism to signal which frames don't have a camera image. If the camera capture rate was higher, some set of camera frames would be dropped.
If the app is recording video rather than just snapping a photo, it would likely be more cumbersome or power-hungry to try to manually feed WebXR images back into a video encoder vs. just asking the system to record a video with per-frame view/projection annotations. That approach also wouldn't support capturing audio.
This gives the app no control over any other media device settings of the camera that they may be able to customize, such as pixel width/height, exposure, auto-gain, etc.

One interesting thing to note about your latest proposal here is that it still offers apps two alternate paths to the XRCameraPose type, one through a given XRView and one through separate enumeration of an XRCamera. If we are comfortable offering two paths, perhaps it is OK for them to differ a bit more given the differences noted above:

//////
// For temporally and spatially view-aligned camera images:
//////

partial interface XRView {
  // Non-null iff there exists an associated camera that perfectly aligns with the view:
  [SameObject] readonly attribute XRCamera? camera;
};

interface XRCamera {
  // Dimensions of the camera image:
  readonly attribute long width;
  readonly attribute long height;
};

partial interface XRWebGLBinding {
  // Access to the camera texture itself:
  WebGLTexture? getCameraImage(XRCamera camera);
};

// TBD mechanism for app to opt out of automatic environment underlay if app is rendering full-screen effect already

//////
// For asynchronous media-based camera images:
//////

// Mechanism for app to opt into XR hologram overlay and/or XR poses
// for a given XRSession during MediaDevices.getUserMedia(constraints):
dictionary XRMediaTrackConstraints : XRMediaTrackConstraintSet {
  sequence<XRMediaTrackConstraintSet> advanced;
};
dictionary XRMediaTrackConstraintSet {
  // Enable secondary view for XRSession to enable system-composited hologram overlay:
  ConstrainBoolean xrSecondaryViewOverlay;
  // Enable getting per-frame camera view/projection matrices:
  ConstrainBoolean xrPoses;
};

partial interface XRSession {
  XRCameraPose? getCameraPose(ImageBitmap bitmap);
};

interface XRCameraPose {
  readonly attribute Float32Array projectionMatrix;
  [SameObject] readonly attribute XRRigidTransform transform;
};

With either your proposal or this proposal, an app that just powers ahead to fetch view.camera will only work on smartphones and tablets anyway, and so we'd still have sites that use this API failing to fetch camera images on headsets.

With your proposal, we could encourage engines to generally ignore the per-view approach and enumerate frame.cameras instead, which would support headsets too. However, let's dig in on whether that would be a good or bad thing for app compat across devices:

If the app was using camera images for a primary view render effect scenario, the app just fell into a trap - the only case where the app should be rendering a given camera view again on top of the primary view is where view.camera is non-null.
If the app was using camera images for a computer vision scenario, the app would probably prefer to use something like the asynchronous MediaDevices.getUserMedia approach to reduce latency on headsets. I am not clear whether apps doing computer vision rather than primary view rendering otherwise benefit from the association of a given camera image with a given XRFrame.
If the app was using camera images for a photo/video hologram overlay capture scenario, the app would in any other situation be using MediaDevices.getUserMedia to configure the camera settings and capture that photo or video. Manually encoding a video with overlaid holograms from per-frame WebXR camera images and a separately captured audio track is likely to be highly non-trivial. The easiest path to synchronized video and audio with overlaid holograms is likely to just have the app opt into some xrSecondaryViewOverlay media track constraint that would then enable a WebXR secondary view for that camera, with the system then taking care of the hologram overlay. (see the HoloLens Mixed Reality Capture opt-in docs for more info about the MixedRealityCaptureVideoEffect apps can add to a Media Foundation stream to record a video with correctly-positioned hologram overlays)

Generally, I'm a huge fan of us fully unifying a given WebXR scenario across phones, tablets and headsets to enable maximum content compatibility! However, across the three scenarios above, only scenario 1 benefits from the per-XRFrame design, and that scenario is not relevant for headsets (unless we supported it for secondary views for optional render effects during scenario 3). For scenarios 2 and 3, the simplest and most robust path that is most aligned with non-XR computer vision would be the xrSecondaryViewOverlay and/or xrPoses constraints that can be used with MediaDevices.getUserMedia, which UAs could then support across phones, tablets and headsets.

blairmacintyre commented 3 years ago

(hi @thetuvix @bialpio!)

At one point in the distant past, we talked about having an async API for delivering frames, but explicitly guaranteeing that if the frames were synchronous with the view (smartphones) that would be delivered before the rAFcb ... each frame would have "all the info" (timestamp, view, projection, ability to request gl texture or bytes). There would likely need to be a way for the app to determine this was happening (e.g., a capability or property?).

It has it's downsides, but it might simplify things.

cabanier commented 3 years ago

Pinging @Manishearth, @blairmacintyre and @cabanier who may have thoughts here... .. Thoughts?

Given how many variables the UA and the author need to account for (framerate, fov, camera framerate, camera characteristics, etc), wouldn't it be better to not spend too much time on raw camera access but instead build out native APIs for computer vision? I worry that given all the variables authors will just test on popular devices and this might give smaller vendors a disadvantage.

In addition, the number one complaint with WebXR is performance. If we push CV and sync compositing to Javascript, I doubt that authors can build good experiences on immersive devices that are built on mobile technologies.

bialpio commented 3 years ago

Based on the discussion, I think the best way to proceed would be to plan on having 2 distinct API shapes that are split based on the scenarios they are solving. In Chrome for Android, we will pursue the per-XRFrame API in a way that I have roughly outlined in the explainer, and I will create an initial specification draft that covers only this variant. This will hopefully avoid the feedback about having a specification with no implementations (as the gUM() variant would not be implemented by anyone). Additionally, for smartphone AR, we may be able to provide the camera access API via the gUM() route at a later time as well, depending on the technical difficulty and the bandwidth we have available.

bialpio commented 3 years ago

(...) wouldn't it be better to not spend too much time on raw camera access but instead build out native APIs for computer vision?

I think this was largely covered during the last IW call - the need for a way of accessing the camera API is there and the consumers of the API seem to be willing to pay the performance cost for this. That is not to say that we can't work on the native APIs for CV, but the problem with that approach is that without the camera access API, we're preventing people from innovating and trying various things that we may not even think about now (I seem to recall "shoe detection" being brought up? :smiley:). As a bonus, we could influence what we're working on based on what solutions crop up in the wild - if it turns out some scenarios / algorithms are very common, we can try to make them a part of the platform.

cabanier commented 3 years ago

I think this was largely covered during the last IW call - the need for a way of accessing the camera API is there and the consumers of the API seem to be willing to pay the performance cost for this. That is not to say that we can't work on the native APIs for CV, but the problem with that approach is that without the camera access API, we're preventing people from innovating and trying various things that we may not even think about now (I seem to recall "shoe detection" being brought up? 😃). As a bonus, we could influence what we're working on based on what solutions crop up in the wild - if it turns out some scenarios / algorithms are very common, we can try to make them a part of the platform.

I see. So is this just an API for experimentation and not meant to ship out to the general public?

bialpio commented 3 years ago

It is meant to be shipped out to the general public, who will then amaze us with what they are able to come up with, and hopefully will give us a chance to learn from those experiments & influence what CV algorithms we can then attempt to standardize. At this point, based on what @nbutko shared during the last IW call, I'm worried that we're hurting the developers by not giving them access to the camera pixels in a way that could be decorated with information coming from WebXR. And, given that we prioritized & launched the privacy-preserving APIs first, the developers should be able to pick those and only choose raw camera access API once they are forced to.

bialpio commented 3 years ago

Now that the discussion slowed down a bit and the dust settled, I think this will be a good moment to archive this repository and move future discussions to https://github.com/immersive-web/raw-camera-access/ repo.

bialpio / webxr-raw-camera-access

how would this work on non-video-mixed devices? #1