OpenVisualCloud / Media-Transport-Library

A real-time media transport(DPDK, AF_XDP, RDMA) stack for both raw and compressed video based on COTS hardware.
BSD 3-Clause "New" or "Revised" License
161 stars 51 forks source link

RTP timestamp not calculated correctly for audio #956

Open hogliux opened 2 weeks ago

hogliux commented 2 weeks ago

When inspecting the egress timings of RTP packets produced by MTL with wireshark, I see that the RTP timestamp encoded in the RTP message roughly matches the egress time of the RTP packet:

Screenshot from 2024-08-29 16-56-46

Wireshark is set to show the time in SECS.NANOS since epoch, hence, nanos since epoch of the packet egress time (marked red above) is:

(1724943228*1000000000)+343008611=1724943228343008611

converted to samples (@ 48kHz):

1724943228343008611*1e-9*48000=82797274960464

printing the lower 32-bits of that number:

82797274960464 & 0xffffffff = 3190395472

which is exactly the RTP timestamp field in that packet (also marked red above).

However, I believe this to be incorrect:

Section 7.7.2 of the ST 2110-10 standard says:

[...] the RTP Timestamp of audio RTP packets should reflect the sampling instant of the first sample of the audio signal within the audio RTP packet.

Hence, logically, the earliest an RTP packet can possibly be sent, is after the last sample inside the RTP packet has been captured, i.e. 1ms after the the time indicated by the RTP timestamp field (I'm assuming a ptime of 1ms here). With the currently faulty RTP timestamp field, a receiver seeing this packet on the wire, will conclude, for example, that the last sample in the RTP packet was captured 1ms in the future, which doesn't make sense.

In deed, on the few hardware and virtual aes67 audio devices that we have access to, the packets is sent out earliest 1ms after what is encoded in the RTP timestamp field.

This issue is causing us synchronization issues with said audio devices.

Note: that this distinction does not matter for video as RTP packets typically only carry one frame (i.e. the first and last frame are identical anyway).

ricmli commented 2 weeks ago

In this case, you can try this flag for st30 session: ST30_TX_FLAG_USER_TIMESTAMP, the user should provide the rtp timestamp (which reflects the sampling time) for each 1ms frame in the next_frame callback with st30_tx_frame_meta.

frankdjx commented 2 weeks ago

Hence, logically, the earliest an RTP packet can possibly be sent, is after the last sample inside the RTP packet has been captured, i.e. 1ms after the the time indicated by the RTP timestamp field (I'm assuming a ptime of 1ms here). With the currently faulty RTP timestamp field, a receiver seeing this packet on the wire, will conclude, for example, that the last sample in the RTP packet was captured 1ms in the future, which doesn't make sense.

Great findings, it can be easily fixed by change the function tx_audio_pacing_time_stamp in https://github.com/OpenVisualCloud/Media-Transport-Library/blob/main/lib/src/st2110/st_tx_audio_session.c#L236

uint64_t tmstamp64 = epochs * pacing->pkt_time_sampling;

to

uint64_t tmstamp64 = (epochs + 1) * pacing->pkt_time_sampling;

This change should fix the synchronization issues for your audio device.

frankdjx commented 2 weeks ago

And other enhancement is providing a option to allow user to customize the RTP timestamp offset to the epoch. MTL provide a option in https://github.com/OpenVisualCloud/Media-Transport-Library/blob/main/include/st20_api.h#L1200 for ST20, we can apply a similar approach for audio also.

hogliux commented 2 weeks ago

Thank you @frankdjx and @ricmli for your suggestions. That's very useful to know. I've tested @frankdjx suggested code change and that works. An option to customize the offset to the epoch would be even better of course.