pts trouble with audio from dshow

I'm recording audio from dshow and there seems to be a problem with the pts. ffmpeg isn't complaining about these sources, but I also think this might not be a bug in pyav. I tried several input devices and it always creates some error like:

Encoder did not produce proper pts, making some up.
Traceback (most recent call last):
  File "C:/Users/User/PycharmProjects/project/audio.py", line 43, in <module>
    output_container.mux(output_audio_stream.encode(frame))
  File "av\stream.pyx", line 155, in av.stream.Stream.encode
  File "av\codec\context.pyx", line 466, in av.codec.context.CodecContext.encode
  File "av\audio\codeccontext.pyx", line 40, in av.audio.codeccontext.AudioCodecContext._prepare_frames_for_encode
  File "av\audio\resampler.pyx", line 122, in av.audio.resampler.AudioResampler.resample
ValueError: Input frame pts 980000 != expected 1000000; fix or set to None.

I can obviously just set the pts to None and it will just make the some up. This seems to be deprecated and results in the following warning

Timestamps are unset in a packet for stream 0. This is deprecated and will stop working in the future. Fix your code to set the timestamps properly
Encoder did not produce proper pts, making some up.

To debug the problem I wrote the following code, which prints pts, time base, sample rate, samples etc. and allows to see the what exactly is going on.

import av

# doesn't work with dshow (tried different devices, and also specifying the sample rate)
input_container = av.open(format='dshow',
                          file='audio=Eingang (High Definition Audio Device)',
                          #file='audio=Mikrofon (USB2.0 MIC)',
                          options={"audio_buffer_size": "100",
                                   #"sample_rate": "44100",
                                   },
                          )

# # works
# input_container = av.open(format='lavfi', file='sine=frequency=1000:duration=5',
#                           options={"sample_rate": "44100"})

input_audio_stream = input_container.streams.audio[0]

output_container = av.open('test.mkv', mode='w')
output_audio_stream = output_container.add_stream("aac", rate=44100)

first_pts = None

next_pts = 0
for frame in input_container.decode(audio=0):

    # subtract first pts to start at zero
    if first_pts is None:
        first_pts = frame.pts
    frame.pts -= first_pts

    print(f"pts: {frame.pts}")
    print(f"time base: {frame.time_base}")
    print(f"sample rate: {frame.sample_rate}")
    print(f"samples: {frame.samples}")
    print(f"array_shape/2: {frame.to_ndarray().shape[1]/2}")  # we have stereo
    pts_per_sample = frame.time_base.denominator / frame.time_base.numerator
    pts_per_sample /= frame.sample_rate
    next_pts = frame.pts + pts_per_sample*frame.samples
    print(f"next pts: {next_pts}")
    print("---------")

    frame.pts = None
    output_container.mux(output_audio_stream.encode(frame))

Example output:

pts: 0
time base: 1/10000000
sample rate: 44100
samples: 4410
array_shape/2: 4410.0
next pts: 1000000.0
---------
pts: 980000
time base: 1/10000000
sample rate: 44100
samples: 4410
array_shape/2: 4410.0
next pts: 1980000.0
---------
pts: 2000000
time base: 1/10000000
sample rate: 44100
samples: 4410
array_shape/2: 4410.0
next pts: 3000000.0
---------
pts: 3010000
time base: 1/10000000
sample rate: 44100
samples: 4410
array_shape/2: 4410.0
next pts: 4010000.0
---------
pts: 4030000
time base: 1/10000000
sample rate: 44100
samples: 4410
array_shape/2: 4410.0
next pts: 5030000.0
---------
pts: 5050000
time base: 1/10000000
sample rate: 44100
samples: 4410
array_shape/2: 4410.0
next pts: 6050000.0
---------

How should we handle this?

As a test I run ffmpeg with debug mode

ffmpeg -v debug -f dshow -i "audio=Eingang (High Definition Audio Device)" -f null -

and it seems like there is the same effect

dshow passing through packet of type audio size    88200 timestamp 801524400000 orig timestamp 801524400000 graph timestamp 801529380000 diff 4980000 Eingang (High Definition Audio Device)
dshow passing through packet of type audio size    88200 timestamp 801529380000 orig timestamp 801529380000 graph timestamp 801534350000 diff 4970000 Eingang (High Definition Audio Device)
dshow passing through packet of type audio size    88200 timestamp 801534350000 orig timestamp 801534350000 graph timestamp 801539340000 diff 4990000 Eingang (High Definition Audio Device)
dshow passing through packet of type audio size    88200 timestamp 801539340000 orig timestamp 801539340000 graph timestamp 801544410000 diff 5070000 Eingang (High Definition Audio Device)
dshow passing through packet of type audio size    88200 timestamp 801544410000 orig timestamp 801544410000 graph timestamp 801549390000 diff 4980000 Eingang (High Definition Audio Device)
dshow passing through packet of type audio size    88200 timestamp 801549390000 orig timestamp 801549390000 graph timestamp 801554370000 diff 4980000 Eingang (High Definition Audio Device)
dshow passing through packet of type audio size    88200 timestamp 801554370000 orig timestamp 801554370000 graph timestamp 801559350000 diff 4980000 Eingang (High Definition Audio Device)
dshow passing through packet of type audio size    88200 timestamp 801559350000 orig timestamp 801559350000 graph timestamp 801564330000 diff 4980000 Eingang (High Definition Audio Device)
dshow passing through packet of type audio size    88200 timestamp 801564330000 orig timestamp 801564330000 graph timestamp 801569410000 diff 5080000 Eingang (High Definition Audio Device)
dshow passing through packet of type audio size    88200 timestamp 801569410000 orig timestamp 801569410000 graph timestamp 801574380000 diff 4970000 Eingang (High Definition Audio Device)
dshow passing through packet of type audio size    88200 timestamp 801574380000 orig timestamp 801574380000 graph timestamp 801579360000 diff 4980000 Eingang (High Definition Audio Device)
dshow passing through packet of type audio size    88200 timestamp 801579360000 orig timestamp 801579360000 graph timestamp 801584340000 diff 4980000 Eingang (High Definition Audio Device)
dshow passing through packet of type audio size    88200 timestamp 801584340000 orig timestamp 801584340000 graph timestamp 801589420000 diff 5080000 Eingang (High Definition Audio Device)
dshow passing through packet of type audio size    88200 timestamp 801589420000 orig timestamp 801589420000 graph timestamp 801594400000 diff 4980000 Eingang (High Definition Audio Device)
dshow passing through packet of type audio size    88200 timestamp 801594400000 orig timestamp 801594400000 graph timestamp 801599370000 diff 4970000 Eingang (High Definition Audio Device)
dshow passing through packet of type audio size    88200 timestamp 801599370000 orig timestamp 801599370000 graph timestamp 801604350000 diff 4980000 Eingang (High Definition Audio Device)
dshow passing through packet of type audio size    88200 timestamp 801604350000 orig timestamp 801604350000 graph timestamp 801609330000 diff 4980000 Eingang (High Definition Audio Device)
dshow passing through packet of type audio size    88200 timestamp 801609330000 orig timestamp 801609330000 graph timestamp 801614410000 diff 5080000 Eingang (High Definition Audio Device)
dshow passing through packet of type audio size    88200 timestamp 801614410000 orig timestamp 801614410000 graph timestamp 801619380000 diff 4970000 Eingang (High Definition Audio Device)
dshow passing through packet of type audio size    88200 timestamp 801619380000 orig timestamp 801619380000 graph timestamp 801624360000 diff 4980000 Eingang (High Definition Audio Device)
dshow passing through packet of type audio size    88200 timestamp 801624360000 orig timestamp 801624360000 graph timestamp 801629330000 diff 4970000 Eingang (High Definition Audio Device)

I guess that means that this isn't a problem auf pyav and I should take a look at ffmpeg to see how they are handling this?

I agree with your assessment, if the ffmpeg command line tool exhibits the same behaviour I don't think PyAV can do much about it. Feel free to keep digging and please report back so that other users with the same issue can learn from you!

I think I figured it out: The problem occurs, because for aac the frames get passed through resampler and the fifo. If you choose a format that doesn't need any of this it will just work.

output_audio_stream = output_container.add_stream("pcm_s16le", rate=44100)

The real problem lies in how pyav implements resampler and the fifo. They aren't made to handle any pts inconsistencies. If we take a look at the ffmpeg tool we can see that the heavy lifting for the resampling is done by the aformat filter (https://github.com/FFmpeg/FFmpeg/blob/169259d9a381a3c2132672da5c5f250fa194fb4d/fftools/ffmpeg_filter.c#L607). The fifo is implemented by the the buffersink (https://github.com/FFmpeg/FFmpeg/blob/169259d9a381a3c2132672da5c5f250fa194fb4d/fftools/ffmpeg_filter.c#L1113).

I think pyav should do the same. This will simplify the code quite a bit and will also solve this problem. One problem I encountered was that any filtering would discard the time_base of a frame, but #765 should fix this. Now it shouldn't be hard to use filters for the heavy lifting.

I'll try to create a PR to match the pyav resampler behavior to the ffmpeg tool.

PyAV-Org / PyAV

pts trouble with audio from dshow #761