Audio and Video Sync Issue

matchingCube commented 1 year ago

Thanks for your great module. Until 480p video, it is good. But when I record 720p video, then audio & video sync problem occurs. Audio length is longer than video. Can you guide me how to resolve this issue? Or is it this module's bug?

Vanilagy commented 1 year ago

Could very well be a bug, although I thought I had already fixed this bug in the past. Could I have a bit more context, maybe the code you use to add chunks to the muxer, and/or a demo output file with the desync present?

matchingCube commented 1 year ago

Thanks for your reply. Can you share your email or whatsapp or telegram info so that I can share my project and the issue via the video made using my app?

matchingCube commented 1 year ago

Excuse me, can we continue discussing?

Vanilagy commented 1 year ago

Oh I'm sorry, I forgot to reply.

I'd be willing to help you privately only for a charge, as I'm also busy with other things. I think the much better solution would be for you to post the relevant code (with any sensitive parts removed) and a demo file right here on GitHub, so it's publicly available and is there to help people in the future.

matchingCube commented 1 year ago

Unfortunately, I can't share my project. If you can fix it, I will pay for your work.

Vanilagy commented 1 year ago

Is it not possible to create a demo video that reveals no big secrets about your project? Or a code snipped with sensitive parts stripped out?

matchingCube commented 1 year ago

https://drive.google.com/file/d/18VKursQo1f3Mo70NbPP6_khivLA4al3D/view?usp=drive_link Please request access to this file so that you can check the test video with v-a sync issue.

And I invited you to my git repo of the project so that you can run it yourself and check the code.

Vanilagy commented 1 year ago

I'll happily inspect your issue and help you out in private - I charge 48 USD an hour. You can send me this money over my Ko-fi which you find here: https://ko-fi.com/vanilagy

After you do, I'll look into the problem as soon as possible, so likely during the evenings (CEST).

UtkuBulkan commented 1 year ago

Hi @Vanilagy , same issue here. Can I use ko-fi?

Vanilagy commented 1 year ago

@UtkuBulkan You can, but I'd urge you to try and send a demo file here that shows this desync - or show me the code that produced the file, with sensitive data stripped out. This way, we can all benefit from the solution :)

Mat-thieu commented 10 months ago

I've had this issue (the audio would gradually get out-of-sync because it played slower) and managed to resolve it. For me it came down to making sure that the sampleRate of the audioEncoder is the same as the sampleRate of the file you decode.

    let audioContext = new AudioContext();
    let audioBuffer = await audioContext.decodeAudioData(await (await fetch('./assets/BigBuckBunny.mp4')).arrayBuffer());
    audioEncoder.configure({
       codec: 'mp4a.40.2',
       sampleRate: audioBuffer.sampleRate, // the important bit which was a static value for me before
       numberOfChannels,
       bitrate: 128000,
    });
    let dataLength = (duration * audioBuffer.sampleRate);
    let data = new Float32Array(dataLength);

    const channel0 = audioBuffer.getChannelData(0).subarray(0, dataLength);
    data.set(channel0);

    let audioData = new AudioData({
        format: 'f32-planar',
        sampleRate: audioBuffer.sampleRate,
        numberOfFrames: dataLength,
        numberOfChannels: 1,
        timestamp: 0,
        data
    });
    audioEncoder.encode(audioData);
    audioData.close();

Code above may not work entirely, just showing the gist of it, had to cut out a bunch of stuff like trying to combine channels and adding silence frames

Your problem might be slightly different, it's hard to say, I hope this helps.

Vanilagy commented 10 months ago

@Mat-thieu That's interesting and somewhat unexpected. I would expect that when the encoder sample rate doesn't match the input buffer's sample rate, it automatically resamples the input before encoding.

Mat-thieu commented 10 months ago

Yeah I expected (and hoped 😄) that as well. But unfortunately resampling has to be a different, additional step, from some shallow researching I couldn't find any evidence for AudioEncoder doing any resampling, but maybe there's a way?

Also, I haven't dived into this part of the code, but from what I can see the Muxer options "audio.sampleRate" and "audio.numberOfChannels" don't seem to matter for the output, instead they will be the same as the AudioEncoder

Vanilagy commented 10 months ago

Keep me updated if you find anything else that looks like a desync. Sadly, the other people in this thread have stopped responding :(

Vanilagy commented 7 months ago

I'll close this issue for now as it's stale. If y'all still need help with something, feel free to reopen it :)

LiubomyrB commented 6 months ago

Hi @Vanilagy. I have the same issue when there is high CPU usage. If there is high CPU usage, then setInterval's callback is fired less often (not exactly 30 times per second). I prepared the next demo that will help you to reproduce this issue. Recording will work fine first 10 seconds. Then CPU will be slowed down. You can run copy+paste this code to the console of mp4-muxer demo. Also, here is the full video of the issue; and here is the result video where desync starts after 10th second.

function drawStopwatch(seconds) {
    const canvas = document.querySelector('canvas');
    const ctx = canvas.getContext('2d');

    ctx.clearRect(0, 0, canvas.width, canvas.height);

    ctx.beginPath();
    ctx.arc(canvas.width / 2, canvas.height / 2, 80, 0, 2 * Math.PI);
    ctx.stroke();

    ctx.font = '20px Arial';
    ctx.textAlign = 'center';
    ctx.fillText('Stopwatch', canvas.width / 2, 30);

    ctx.font = '30px Arial';
    ctx.fillText(formatTime(seconds), canvas.width / 2, canvas.height / 2 + 10);
}

function formatTime(seconds) {
    const hours = Math.floor(seconds / 3600);
    const minutes = Math.floor((seconds % 3600) / 60);
    const remainingSeconds = seconds % 60;

    return pad(hours) + ':' + pad(minutes) + ':' + pad(remainingSeconds);
}

function pad(value) {
    return value < 10 ? '0' + value : value;
}

function cpuIntensiveTask() {
    let sum = 0;
    for (let i = 0; i < 150000000; i++) { // Adjust the loop count
        sum += i;
    }
}

let seconds = -1;
const interv = setInterval(() => {
    drawStopwatch(++seconds);
}, 1000);

setInterval(function() {
    let duration = (document.timeline.currentTime - startTime)
    console.log('fps', framesGenerated / (duration / 1000))
}, 1000)

// Call the CPU-intensive task repeatedly after 10 seconds
setTimeout(function() {
    window.throttleInterval = setInterval(cpuIntensiveTask, 100);
    setTimeout(function() {
        clearInterval(window.throttleInterval);
    }, 10000) //turn throttling off
}, 10000)

LiubomyrB commented 6 months ago

Does anyone have any ideas on how to fix this?

Vanilagy commented 6 months ago

@LiubomyrB Sorry for the late response!

Yes so, the problem here stems from the fact that a new video frame is generated every time the setInterval kicks (which you correct asserted slows down when the CPU is blocked), but the timestamp of that frame is based on a constant increment/formula instead. This explains the desync that occurs.

const encodeVideoFrame = () => {
    let elapsedTime = document.timeline.currentTime - startTime;
    let frame = new VideoFrame(canvas, {
        timestamp: framesGenerated * 1e6 / 30,
        duration: 1e6 / 30
    });
    framesGenerated++; // <-- this thing here
...

Using a fixed increment to determine the timestamp, but using a setInterval to call this function, is actually not the right thing to do. I've done a lot of game dev which also has a lot of loops and fixed timestep stuff, and in a way, this is a "rookie" mistake. I just kept the demo like this for simplicity purposes. The fix depends on what your requirements are and what you use this library for.

If you render completely offline things (like you draw to a canvas, with no real-time webcam or microphone stream), then you shouldn't even use any form of interval. You should just crank through the frame generation as fast as possible and pipe that all into the encoder.
If you have only real-time input (so for example webcam+mic), then you don't need to do anything since the timestamp of the VideoFrame/AudioData that are spit out from these streams are in the correct frame of reference, and will increase naturally when the CPU is blocked. This is why the audio track in the video is correct, whereas the video is not.
If you have a mix of the two (like in the demo): The correct thing here would be to duplicate canvas frames if the CPU is too blocked. Meaning, if you have only time to render two frames in a second, you would use the first frame 15 times and the second one 15 times (assuming 30 FPS). It's this logic here, which is also used commonly in game development to apply the correct amount of simulation ticks to the world regardless of how blocked the CPU is:

const updateRate = 30;
const updatePeriod =  1000 / updateRate;

let lastTickTime: number | null = null;
function tick() {
  const now = document.timeline.currentTime;

  if (lastTickTime === null) {
    encodeVideoFrame();
    lastTickTime = now;

    return;
  }

  // This loop now encodes as many frames as necessary to "catch up" with now again
  while (now - lastTickTime >= updatePeriod) {
    encodeVideoFrame();
    lastTickTime += updatePeriod;
  }
}

Now, you call tick as often as you want. You should call it at least as often as updateRate times per second, but calling it more often does not hurt. You can call it as setInterval(tick, 0) which will call it at about 250 Hz.

For this to make sense, encodeVideoFrame needs to mathematically determine its timestamp. So basically what it's doing in the demo with framesGenerated++.

LiubomyrB commented 6 months ago

@Vanilagy Thank you, it works!

Vanilagy commented 6 months ago

Awesome!

LiubomyrB commented 5 months ago

It's better, but I still have some issues. I also tried another way to record video+audio. I have a canvas where I draw frames from different sources (camera, video file) like in OBS. Then I do next:

let stream = canvas.captureStream(30);
stream.addTrack(mixedAudio); //audio from all sources are mixed into one audio track
let videoTrack = new MediaStreamTrackProcessor(stream.getVideoTracks()[0]);
let audioTrack = new MediaStreamTrackProcessor(stream.getAudioTracks()[0]);
then I pass the readable streams of those processors to the web-worker and then to video and audio encoders which send chunks to Mp4Muxer.

Expected result: I thought it would work the same way when I record video from Canvas in webm format using new MediaRecorder(stream) - audio and video are synced with each other even when there is a high CPU load (if there is a high CPU load, then both audio and video start freezing synchronously). I thought that frames and audio data would have more less the same timestamps, so it would not have sync issues.

Actual behavior: video is slower than audio when there is high CPU usage - when a user starts browsing files on his PC etc (tested on i5-7200U, 2 cores, 4 threads, 8 GB RAM).

The question is: how video and audio are mixed while recording with mp4-muxer. Do they wait for each other, for example, when audio is normal but video frames are created slower? Does audioData wait for VideoFrame with same/similar timestamp? Or are they mixed on the fly - immediately when data is passed to the muxer? On "read me" page you pointed out that the muxer needs to wait for the chunks from both media to finalize any given fragment (while recording fragmented MP4). Does it work the same way while recording regular mp4?

Vanilagy commented 5 months ago

@LiubomyrB That's an interesting question! Also, excuse the delayed response.

For regular, non-fragmented MP4 files, there is no need to "wait" for the other track's chunks to write into the file. That's because the timestamps will be put into the header of the file later on anyway, and will then play back correctly, assuming video and audio both have the same duration. For fragmented MP4, I actually need to interleave the chunks (the waiting you referred to), because once a segment is finalized, I can't change it anymore, so I need the chunks from both tracks.

I'm not exactly sure what you're building, but I can tell you that you do NOT need to use captureStream for canvas. The captureStream feels more hacky and works when you need to pull things out of some live media, like a microphone, but for a canvas, you can simply use the VideoFrame constructor, like I do in my demo.

You should log out the timestamps of your encoded chunks of both tracks in the case of high CPU load. It could be that one of the tracks is counting time differently than the other, even though this is strange if both use captureStream, since you'd expect the timestamps to stay roughly in-sync.

LiubomyrB commented 5 months ago

I am trying to make mp4-muxer work on slower computers without the bug with audio/video synchronization. Currently, I record in fragmented mp4 as it allows saving each chunk separately to OPFS so we can recover it if the recording is stopped unexpectedly. The problem is that the first 20 seconds of the recording are fine. Next, it begins to desynchronize gradually - audio is faster than video, but it's the video that has the correct length. I built mp4 muxer with some additional logs.

Below I logged the next information while recording 1 minute of video:

1. **timestamp** (this.#audioTrack.lastTimestamp-this.#videoTrack.lastTimestamp), this.#audioTrack.lastTimestamp, this.#videoTrack.lastTimestamp 2. **queue** this.#audioSampleQueue.length, this.#videoSampleQueue.length ``` - timestamp 0.01242899999184921 0.9599999999918509 0.9475710000000017 - queue 1 0 - timestamp 0.011323999993697598 2.0053329999936977 1.9940090000000001 - queue 1 0 - timestamp 0.019051999987077295 3.007999999987078 2.9889480000000006 - queue 12 0 - timestamp 0.24137699999043782 3.989332999990438 3.7479560000000003 - queue 1 0 - timestamp 0.0029279999983700122 4.99199999999837 4.989072 - queue 1 0 - timestamp 0.004805999998687582 5.994665999998688 5.98986 - queue 2 0 - timestamp 0.03132799998902591 7.018665999989025 6.987337999999999 - queue 1 0 - timestamp 0.00897699999999979 8 7.991023 - queue 1 0 - timestamp 0.01063099998576611 9.002665999985766 8.992035 - queue 1 0 - timestamp 0.01442299999369645 10.005332999993698 9.990910000000001 - queue 1 0 - timestamp 0.015365999987077217 11.007999999987078 10.992634 - queue 1 0 - timestamp 0.02039699998739586 12.010665999987395 11.990269 - queue 2 0 - timestamp 0.022953999995324992 13.013332999995328 12.990379000000003 - queue 3 0 - timestamp 0.04386799998566282 14.037332999985665 13.993465000000002 - queue 2 0 - timestamp 0.027954999989026064 15.018665999989025 14.990711 - queue 2 0 - timestamp 0.032816999996954976 16.021332999996957 15.988516000000002 - queue 2 0 - timestamp 0.025374999990333436 17.023999999990338 16.998625000000004 - queue 2 0 - timestamp 0.03635499999065672 18.026665999990655 17.990311 - queue 2 0 - timestamp 0.04165399999858721 19.029332999998587 18.987679 - queue 0 1 - timestamp 0.030474000012603142 20.010665999987395 20.04114 - queue 1 0 - timestamp 0.017369999996859065 21.055999999996857 21.038629999999998 - queue 1 0 - timestamp 0.021076999997170276 22.058665999997174 22.037589000000004 - queue 1 0 - timestamp 0.020735999990549203 23.061332999990555 23.040597000000005 - queue 2 0 - timestamp 0.02187199999848133 24.063999999998487 24.042128000000005 - queue 3 0 - timestamp 0.04919399998882312 25.087999999988824 25.038806 - queue 0 1 - timestamp 0.04128400000477228 26.047999999995227 26.089284 - queue 0 1 - timestamp 0.030924000012923614 27.007999999987078 27.038924 - queue 0 3 - timestamp 0.12444600000652173 27.96799999999348 28.092446000000002 - queue 0 5 - timestamp 0.20363800000164645 28.885332999998354 29.088971 - queue 0 6 - timestamp 0.2651800000067581 29.823999999993248 30.089180000000006 - queue 0 6 - timestamp 0.2843980000033923 30.805332999996608 31.089731 - queue 0 8 - timestamp 0.3678850000130751 31.72266599998693 32.090551000000005 - queue 0 8 - timestamp 0.3846800000020991 32.703999999997905 33.088680000000004 - queue 0 9 - timestamp 0.43214600001024905 33.663999999989755 34.096146000000005 - queue 0 9 - timestamp 0.4440070000068843 34.645332999993116 35.08934 - queue 0 10 - timestamp 0.4923720000004863 35.60533299999952 36.097705000000005 - queue 0 11 - timestamp 0.5294300000071033 36.6079999999929 37.13743 - queue 0 11 - timestamp 0.5265130000006977 37.5679999999993 38.094513 - queue 0 12 - timestamp 0.5883100000118944 38.54933299998811 39.137643000000004 - queue 0 13 - timestamp 0.6116620000085291 39.53066599999147 40.142328 - queue 0 13 - timestamp 0.6259710000121075 40.51199999998789 41.137971 - queue 0 14 - timestamp 0.665404000005708 41.471999999994296 42.137404000000004 - queue 0 16 - timestamp 0.7265880000038791 42.41066599999613 43.137254000000006 - queue 0 14 - timestamp 0.8112330000059416 43.32799999999406 44.139233000000004 - queue 0 16 - timestamp 0.8307320000025769 44.30933299999742 45.140065 - queue 0 17 - timestamp 0.8246760000091982 45.3119999999908 46.136676 - queue 0 16 - timestamp 0.848970000005842 46.29333299999416 47.142303000000005 - queue 0 17 - timestamp 0.8450230000124606 47.295999999987544 48.141023000000004 - queue 0 18 - timestamp 0.8610330000091011 48.277332999990904 49.138366000000005 - queue 0 17 - timestamp 0.8838770000057394 49.258665999994264 50.142543 - queue 0 20 - timestamp 0.9496120000093171 50.23999999999069 51.189612000000004 - queue 0 19 - timestamp 0.9666880000059521 51.22133299999405 52.188021 - queue 0 20 - timestamp 0.9602380000141082 52.1813329999859 53.141571000000006 - queue 0 20 - timestamp 1.0070320000061699 53.18399999999383 54.191032 - queue 0 21 - timestamp 1.0452170000143184 54.14399999998568 55.189217 - queue 0 22 - timestamp 1.0627100000109664 55.12533299998904 56.18804300000001 - queue 0 22 - timestamp 1.2330190000015335 55.95733299999847 57.190352000000004 - queue 0 26 - timestamp 1.2949220000066362 56.895999999993364 58.190922 - queue 0 27 - timestamp 1.3136480000032833 57.877332999996725 59.19098100000001 - queue 0 28 - timestamp 1.3817320000099187 58.77333299999009 60.15506500000001 - queue 0 27 - timestamp 1.3651790000019872 59.77599999999802 61.14117900000001 - queue 0 25 - timestamp 1.4317480000001552 60.71466599999985 62.14641400000001 ```

As you can see, the difference between audio and video timestamp is 1.43s at the end. Turned out that AudioEncoder encodes audio slower than VideoEncoder encodes video as there is no big difference between audio and video timestamp before audioEncoder.encode(audioData).

The question is: do we need to process audio via AudioEncoder? Is it possible to pass AudioData from MediaStreamTrackProcessor to mp4-muxer directly without AudioEncoder?

Vanilagy commented 5 months ago

I have multiple thoughts on this:

If one encoder is slower than the other, this still shouldn't lead to an incorrect/desynced file. It simply means the file takes as long to create as the slowest encoder takes to finish. As long as your finalization steps looks like this:

await Promise.all([audioEncoder.flush(), videoEncoder.flush()]); // Awaiting serially might also be fine
muxer.finalize();

You should be good.

The more important thing would be that the timestamps should eventually line up. That is, when both encoders are finished, look at their last chunks. Ideally, their timestamp + duration values should be very close. If the last timestamps you get look like 60.71466599999985 62.14641400000001, and there are no more chunks coming after that, then something is off with the way you encode media, where one medium is somehow longer than the other. That's why there are 25 video chunks still in the queue, because the video is already at 62 seconds, but your audio is still at 60 seconds. Can you look into what's going on there? Can you share how you determine the timestamps for both video and audio? (I assume for audio, it's coming straight out of a MediaStreamTrackProcessor.)

The question is: do we need to process audio via AudioEncoder? Is it possible to pass AudioData from MediaStreamTrackProcessor to mp4-muxer directly without AudioEncoder?

Again, I'm not sure the AudioEncoder is actually the bottleneck here, especially because encoding audio is also way faster than encoding video. But in general, I guess you can go from AudioData directly to an EncodedAudioChunk, but then you must use some raw, uncompressed format, since AudioData is uncompressed. Probably not what you want!

LiubomyrB commented 5 months ago

I fixed the bug with synchronization (in this case, on Mac mini 2014, at least) by simply specifying the "latencyHint": 0.30 (or "playback" at least) option when creating AudioContext. If to leave this option unspecified (default - "interactive"), then clicking sounds will appear after the first 20s of recording - this, in turn, causes the bug with synchronization. Probably, underpowered computers just are not able to process audio fast in "interactive" mode. So, sorry for bothering you.

Vanilagy / mp4-muxer

Audio and Video Sync Issue #17