ExoPlayer Previews the Contents of a transformer.Composition

Use case description

Ability to set an instance of androidx.media3.transformer.Composition into an instance of androidx.media3.exoplayer.ExoPlayer so that the ExoPlayer instance is able to playback/loop/seek an audio/video preview of the contents of the Composition without requiring that a androidx.media3.transformer.Transformer first generate an output video file. Robust video editing applications do not generally require that an output file, made from an entire video editing timeline, is created prior to the user being able to preview the results of adjustments that were made using an interactive video-editing timeline-control.

Understanding that the Composition is part of androidx.media3.transformer and ExoPlayer is part of androidx.media3.exoplayer, I hope that something similar to my feature request, at least in spirit, is possible.

Proposed solution

Similar to ExoPlayer.setMediaSource(), or ExoPlayer.setMediaItem(), an app could call ExoPlayer.setComposition(). Alternatively, perhaps a Composition could generate something like a MediaSource that could be set into the ExoPlayer. AVFoundation, on iOS and MacOS, offers exactly this functionality; an app can create an instance of AVComposition and then set it into an instance of AVPlayer, which can then provide a preview of the contents of the composition.

Alternatives considered

Create my own video editing framework for Android. Especially when considering transitions, such as cross dissolves, live-previewing the contents of something as complex as a full-featured video editing timeline cannot be successfully handled by existing ExoPlayer APIs such as ExoPlayer.setVideoEffects() or ExoPlayer.setMediaSources()

Thank you for a detailed use case description. We are working on enabling preview for a Composition and will gradually expand the APIs to support preview of more complex video compositions.
Could you describe your use case in more detail, what type of compositions would you want to create:

Sequential videos/images with not cross fades
Media items (video/image) in a sequence with a background audio track
Multiple sequences of media items with overlapping or PiP compositions

Thank you for your interest in our use case. We greatly appreciate the effort that your team is putting into the new Media3 framework!

The following is a list of specifics that we would like the Composition to support. I will describe these specifics in terms of the transformer and exported files. However, as indicated by my initial post, we would also very much like to have similar functionality be available during composition preview via something like ExoPlayer playing-back/scrubbing/looping through a composition while displaying the video via a GL Surface (and the audio via whatever). Additionally, I know that some of these specifics may already fully supported by the transformer, but I am listing them here for completion:

Visual assets:

We would like to be able to commingle, within a ‘video track’ of the composition, SDR and HDR video files of a variety of different frame rates, frame sizes, bit depths, and ‘color spaces’ (primaries, transfer functions, range, etc), and then be able to specify to the transformer the video format for the exported file as well as the encoded format. The video format would include things like output frame rate, frame size, bit depth, and color space. The encoded format would at least include things like codec type (we’d be happy with HEVC and AVC), IDR cadence, bitrate.

In the case where we specify a bit depth / color space for the exported file that does not match each and every source video that exists in the composition, we would like to be able to optionally play a role in the color conversion phase of the render stack in case the render stack is not handling the conversion properly. This likely means that we should be able to inject YUV->RGB shaders and a RGB->YUV shaders optionally as necessary (perhaps, for the sake of sanity, such custom YUV->RGB or RGB->YUV shaders would only deal with NV12 and P010 formats, with all other format-conversions not being customizable?)

In the case where the frame sizes source files do not match the format specified for the exported file, we would like to have high-level options like ’scale to fit’ and ‘scale to fill’ so as to size the source video frame within the frame of the exported file. We would also like to alternatively have direct control over the sizing of a given video-frame/image within the frame of the exported file.

In the case where we specify an explicit frame rate for the exported file, and in the case where this frame rate is inferred by the transformer, we would like to have the transformer export a video file of a single, constant frame rate whose timestamps for the video frames are as evenly-placed as possible. In my evaluations of the transformer I have been able to export MP4 files, from compositions that contained multiple video assets each at a different frame rate, and the ‘stts’ atom/box of the exported MP4 file contained sample-duration values that greatly varied from one another, more so than what I expected, especially as I expected the output to be constant frame rate file. For if one should not expect that a video file exported by a transformer should be a constant frame rate, and rather each segment of an exported video should play at the rate at which the source for that video segment played, then things like PiP, multiple concurrent tracks of possibly semi-transparent video, and transitions become problematic, as all of these concepts can result in composited video frames created from the video frames from several source videos whose frame rates can differ from one another.

We would like to be able to add still-images to the composition in a way that is similar to videos. These still images may exist in an HDR format. I have found APIs that I believe should be used to insert a still image into a composition. However, I have not yet been successful in getting a still image into the composited output of a transformer using these APIs, and I am wondering if this is possible (or is inhibited by a bug?).

We would like to be able to be able to set an IN and OUT point for each video. We’d like to be able to add a duration for each still-image.

We would like to be able to override the actual frame rate of a given source video file and provide a floating point value by which the frame rate of the source file is scaled. IE if the exported file is to be 30fps and within the video track there is a source file that is 60fps, we could pass a value of 0.5 for this particular asset, which would slow the rate of the asset from 60fps down to 30fps thereby doubling the wall-clock duration of the asset and slowing the effective play-rate of the asset by 2x. We are presently happy to perform such alteration of frame rate without the expectation of anything fancy like frame-blending or optical flow. We are happy to drop frames (eg: when the exported file is 30 fps and the effective rate of the source asset is 60fps) and we are also happy to have frame duplication (eg: when the exported file is 60fps and the effective rate of the source asset is 30fps)

We would like to be able to add a transition between subsequent assets in a given video track. Such a transition could be canned (like a cross fade) or could be custom in a way that would likely be handled similarly to the PeriodicVignetteShaderProgram from the transformer demo app (yet, this transition program would take two inputs…)

Our specific use case would not likely require us to need more than one track of video/images, so long as transitions can occur between two subsequent assets. However…

We would like to be able to add titles overtop the video track, which admittedly may require more than one video track be supported, unless there is a special-case ‘title track’ that is able to be added to the composition (though, such a special-case would probably infringe upon the design of the framework…)

We would like to be able to inject our own effect shaders (eg: PeriodicVignetteShaderProgram, which already exists, woohoo!)

We would like to be able to specify the color of the ‘canvas’. This ‘canvas’ is the solidly colored background of the composited frame that shows through in the event that no pixels from a video or still-image occlude this background. This ‘canvas’ is also effectively what shows through whenever visual assets need to be letter-boxed or pillar-boxed.

We would like to be able to have all relevant parameters of video assets be key-framable (perhaps limited to keyframe interpolation options like linear or ease-in/ease-out), eg: video-frame/image scale; video-frame/image X and Y offset; video frame opacity. We would like to have similar key-framable parameters that apply to the entire composition as a whole.

We are happy to specify the contents of the video tracks of a composition via ‘packed-format’ containers like lists or a sequences. However, we would like the ability to include within a video track areas of empty space of a specific duration, where no video or still-image asset is present, and thus only the canvas or background is visible within the composition. This could be achieved by added something like a SolidColor asset to the composition.

Audible assets:

We would like to be able to commingle, within ‘audio tracks’ of the composition, the audio component of video files as well as arbitrary standalone audio files (see below...), all of which may be of a variety of sample rates, channel configurations, bit depths, and then be able to specify to the transformer the audio format for the exported file as well as the encoded format (we’d be happy with AAC). Fine control over audio is less important to us than fine control over video/imagery. For audio we would like to have a high-level API that tells the transformer to take all the varied audio input and do its best to mix to either mono or stereo (likely always stereo). Perhaps it would be nice to have an option to either discard the audio channels ‘above’ stereo, from within a format like 5.1, or to keep these extra channels and mix them into stereo.

We would like to have the audio components of the video files exist on an audio track within the composition that is separate from the audio track that contains the standalone audio files. We would like virtually all of the temporal properties of the audio assets which are tied to video files to inherit the properties of the respective video assets within the composition.

The standalone audio track that we would add to a composition would likely always contain an audio asset that represents a music file.

We would like for the audio component of video files to be trimmed to the duration of the video component of the video files such that there are no black video frames introduced, either at the beginning of the asset or the end of the asset, nor is there unexpected audio trailing into a subsequent video file of the composition, when the video and audio components of a video file do not temporally match. That said, MP4 files captured from Android phones, in particular, often have an ‘elst’ atom/box within the video track of the file which indicates that there is an ‘empty edit’ before the actual video stream of the file begins. I assume that this occurs in an attempt to sync the audio and the video of the file (however, this often causes non-android-ecosystem players/transcoders to inject a black frame before the video starts, as the only other option, in absence of special-case handling, is to duplicate the first frame). In the event of such an ‘empty edit’ existing in a video file we would request that the video/audio be trimmed such that this ‘empty edit’ be altogether removed from the asset when it exists within a composition. Alternatively, if the audio component of a video file ends before the video component, then we would like to have this audio component padded out using silence so that the duration of the audio component matches the duration of the video component when these two components become what is effectively a single asset within a composition.

We would like to be able to control via a key-framable parameter the effective volume of each and every one of the audio assets (either those audio assets that are tied to the video of a video file or those that are standalone). A use case for such key-framable control would be to either ‘duck’ a music track whenever we wished to focus on the audio component of video, or to fade audio in/out at the beginning and ending of a composition. We would also like a key-framable parameter that controls the effective volume of the entire composition.

We would like to have transitions be available for audio assets, similar to transitions for video assets. We’d likely be happy to use only simple cross fade transitions for audio assets. Whenever a transition is applied to a video asset, the audio-component of this video asset should have a cross fade transition applied between it and any subsequent audio asset.

We are happy to specify the contents of the audios tracks of a composition via ‘packed-format’ containers like lists or a sequences. However, we would like the ability to include within an audio track areas of silence of a specific duration, where no audio is heard within the composition. This could be achieved by added something like a Silence asset to the composition.

General:

We would like to be able to load certain GLES assets, like textures, in a way that they are effectively global to the transformer. We would like to load these only once, at initialization, and keep them alive for the lifespan of the transformer. We would like these textures to be available within our own custom effects (similar in kind to the PeriodicVignetteShaderProgram from the transformer demo app).

Thanks again!

Hi! Thank you for such a detailed feedback. I will start adding answers to some of your questions:

The following is a list of specifics that we would like the Composition to support. I will describe these specifics in terms of the transformer and exported files. However, as indicated by my initial post, we would also very much like to have similar functionality be available during composition preview via something like ExoPlayer playing-back/scrubbing/looping through a composition while displaying the video via a GL Surface (and the audio via whatever). Additionally, I know that some of these specifics may already fully supported by the transformer, but I am listing them here for completion:

The team is working on adding composition preview functionality and we are looking at providing feature parity between preview and export of a Composition.

Visual assets: We would like to be able to commingle, within a ‘video track’ of the composition, SDR and HDR video files of a variety of different frame rates, frame sizes, bit depths, and ‘color spaces’ (primaries, transfer functions, range, etc),

This is all possible already, with some caveats.

For now, if SDR+HDR is comingled, SDR must be output (ex. via tone-mapping). Media3 could fix this by implementing SDR->HDR tone-mapping
While these can be comingled, I think we don’t handle bit depths and color range well or at all within GlEffects are applied on them

and then be able to specify to the transformer the video format for the exported file as well as the encoded format.

This should be supported already

The video format would include things like output frame rate, frame size, bit depth, and color space.

Most of it should be supported, but not setting bit depth and color space, though colorSpace can be partially controlled to output HDR vs SDR, via hdrMode.

The encoded format would at least include things like codec type (we’d be happy with HEVC and AVC), IDR cadence, bitrate.

I believe we support HEVC+AVC+bitrate. I don’t know what IDR cadence is. See ExportResult (internal link) for all fields.

In the case where we specify a bit depth / color space for the exported file that does not match each and every source video that exists in the composition, we would like to be able to optionally play a role in the color conversion phase of the render stack in case the render stack is not handling the conversion properly. This likely means that we should be able to inject YUV->RGB shaders and a RGB->YUV shaders optionally as necessary (perhaps, for the sake of sanity, such custom YUV->RGB or RGB->YUV shaders would only deal with NV12 and P010 formats, with all other format-conversions not being customizable?)

We only support relatively simple color control for apps now, where you can specify an hdrMode (internal link)) For the rest of this, you could already inject these shaders, but color signaling isn’t great, so it may be confusing which color primary you’re currently operating on if you inject custom shaders. Media3 can fix this by implementing “color management”

Continuing:

In the case where the frame sizes source files do not match the format specified for the exported file, we would like to have high-level options like ’scale to fit’ and ‘scale to fill’ so as to size the source video frame within the frame of the exported file. We would also like to alternatively have direct control over the sizing of a given video-frame/image within the frame of the exported file.

This case should be already enabled by using Presentation effect and transformation settings.

In the case where we specify an explicit frame rate for the exported file, and in the case where this frame rate is inferred by the transformer, we would like to have the transformer export a video file of a single, constant frame rate whose timestamps for the video frames are as evenly-placed as possible. In my evaluations of the transformer I have been able to export MP4 files, from compositions that contained multiple video assets each at a different frame rate, and the ‘stts’ atom/box of the exported MP4 file contained sample-duration values whose variation from one another bordered on absurd for what I expected to be constant frame rate file. For if one should not expect that a video file exported by a transformer should be a constant frame rate, and rather each segment of an exported video should play at the rate at which the source for that video segment played, then things like PiP, multiple concurrent tracks of possibly semi-transparent video, and transitions become problematic, as all of these concepts can result in composited video frames created from the video frames from several source videos whose frame rates can differ from one another.

I think there’s 2 issues discussed here:

In a sequence, if we have a 30fps and a 60fps video, the output frame rate should be constant. I’m not really sure if this is the case or not…
When compositing, we should be able to have a constant frame rate.

For Compositor, the frame-rate currently depends on the frame-rate of the primary stream, which may have variable frame-rate. This was discussed, and we had an extension to make a minimum-frame-rate that we ended up not prioritizing at the moment.

We would like to be able to add still-images to the composition in a way that is similar to videos. These still images may exist in an HDR format.

There is support for still-images in compositions, however HDR format is not yet supported. Adding HDR support to image transcoding or BitmapOverlays is on our roadmap.

We would like to be able to be able to set an IN and OUT point for each video.

It is not yet possible to set an IN and OUT point for each video, but it is on our roadmap.

We’d like to be able to add a duration for each still-image.

It is possible to add duration for each still-image by setting durationUs and frameRate when building an EditedMediaItem.

we should be able to inject YUV->RGB shaders and a RGB->YUV shaders optionally as necessary (perhaps, for the sake of sanity, such custom YUV->RGB or RGB->YUV shaders would only deal with NV12 and P010 formats, with all other format-conversions not being customizable?)

I'm curious what the use case for handling YUV, P010 and NV12 formats is here. We have some early plans around color management (basically: signaling compatible formats and optionally doing automatic conversions between effects in the chain), but YUV and subsampled color input to shaders doesn't seem particularly useful. I can imagine they might be desirable if you want to connect a decoder directly to an encoder as an optimized path, though, so I am wondering if that's why you bring these up?

In my evaluations of the transformer I have been able to export MP4 files, from compositions that contained multiple video assets each at a different frame rate, and the ‘stts’ atom/box of the exported MP4 file contained sample-duration values that greatly varied from one another, more so than what I expected, especially as I expected the output to be constant frame rate file.

Currently output timestamps should correspond to the input file(s) (specifically, frame durations match though the timestamps from each stream obviously need to be offset by the cumulative duration before). You should get an even output frame rate if the input frame durations are evenly spaced, and input file durations are correct. When the input has variable frame rates, we can drop frames or change the speed (which will make audio sound worse) but doing frame interpolation to generate evenly spaced output frames is unlikely to be possible in general for a while (and even when phones support this, it's likely to cause a video quality drop because the interpolation algorithm won't be perfect).

androidx / media

ExoPlayer Previews the Contents of a transformer.Composition #1014