Suggestion: Speed up decoder by queue-ing queueInputBuffer (previous idea: BUFFER_FLAG_PARTIAL_FRAME)

vosk commented 1 month ago

Just an idea, I've been browsing through the client decoder stream setup and it seems a whole frame needs to be queued to begin the decoder. Ignoring any details that I have no idea about, it could be possible to retrieve the dequeInputBuffer and dump directly to the mediacodec buffer each chunk from the socket. I haven't explored the details (rust is new to me), but it seems it would save a ptr: copy non overlapping, as well as priming with more data as soon as they arrive. Any thoughts on why this can't work?

vosk commented 1 month ago

It looks like the safest way to do this is to Have VideoDecoderSink produce something like a SocketBuffer which can be fed into SocketReader which will eventually get a https://doc.rust-lang.org/std/io/struct.BorrowedCursor.html through a https://doc.rust-lang.org/std/io/struct.BorrowedBuf.html to do a read(socketbuffer) . using the media_codec queue_input_buffer offset and size arguments, we can send the "video data" part of the buffer for processing. However, both of those are experimental yet.

vosk commented 1 month ago

Abandoning the idea since StreamSocket.recv is used for multiple streams :)

zmerp commented 1 month ago

Sorry I haven't replied sooner. The idea actually makes sense, but not in the way you think. while we cannot do transmission and decoding in parallel for parts of the same frame, we can receive and copy the frame in parallel. This could maybe save 1-2ms. I know because previously we managed to shave some latency just by being careful not to do unnecessary copies of the data

zmerp commented 1 month ago

This would actually be the first instance of intra-frame pipelining in ALVR. currently we show statistics as a simple bar graph where it's impossible to show overlaps. I plan to rework this soon by switching to a hybrid scatter/timeline plot. Let's work on this after I finished the statistics rewrite.

vosk commented 1 month ago

Thank you for taking notice :) You are talking about StreamSocket.recv around line 563 I guess. I havent wrapped my head around the shard logic, but it would require to have a "partial packet" logic around the whole thing I believe.

zmerp commented 1 month ago

Actually no, i didn't mean touching the StreamSocket logic. I mean sending smaller packets and receiving smaller packets, so that StreamSocket will not split the packets at all. Doing this will not take advantage of the packet reordering logic, but it doesn't matter, we can wait because the copy logic is much faster than the network transmission.

In any case, we would still need to test this, it's not guaranteed that this will reduce latency. It will increase CPU usage because the send, recv, and copy logic need to be repeated for each smaller packet instead of once per frame. We might need to find a compromise regarding the packet size. I'd say that this improvement is a micro optimization, since the most we can save is 1 video frame copy.

zmerp commented 1 month ago

As suggested by @xytovl, we don't need to use BUFFER_FLAG_PARTIAL_FRAME, we can just acquire a input buffer and queue it when it has been filled. Let's keep this issue open for the copy pipelining feature.

vosk commented 1 month ago

I see multiple ideas overlapping, so i'll try and summarize what I undrestand: 1) Smaller packets to avoid splitting them up. 2) Decode partial frames 3) Use decoder buffers directly injected into the socket to avoid copies.

The obvious caveat right now as I see it, is the fact that it is guaranteed right now that we have a whole frame before push_frame_nal.

My thoughts There are two "weak" points in the code as I understand it. a) there is a ptr copy in push_frame_nal that copies from the socket buffer into decoder buffer b) there is a resize operation inside streamsocket

b)I believe is not really a problem because shard_length is probably constant so the resize is not that common. Not sure.

My suggested approach is two-fold: i) Make ReceiverData.get_shard() that can deliver partial data. This means that the frame is whole, but not necessarily in a contiguous buffer. This means that ReceiverData must be mutable to have internal state. then push_frame_nal(som_object) can do

while (some_object.haveMore()) {
decoder.push_partial(some_object.next())
}

ii)Make subscribe_to_stream() require a supplier object that can deliver buffers to the socket logic Each buffer can be a shard (as it is now)

i+ii) means the decoder provides buffers, they are filled with shards, each can be allocated once and delivered in-order as data comes through, enabling lost packet and skipped frames to be handled as they are.

Noob question: how do I make rustanalyzer build dependencies for android? where do I put --platform in vscode.

zmerp commented 1 month ago

I don't think we can do any better. We have one copy from socket buffer to decoder buffer. on the server/send side instead we need to prefix the data with the packet header (at least the packet index), which means a copy. So nothing we can save

zmerp commented 1 month ago

i) Make ReceiverData.get_shard() that can deliver partial data. This means that the frame is whole, but not necessarily in a contiguous buffer. This means that ReceiverData must be mutable to have internal state. then push_frame_nal(som_object) can do while (some_object.haveMore()) { decoder.push_partial(some_object.next()) }

This is a interesting idea, but of limited use. we can just do its own separate thing for video.

Noob question: how do I make rustanalyzer build dependencies for android? where do I put --platform in vscode.

You should be able to change the "check" command for rust-analyzer. I did it some times but changing it back and forth got old. if I need to edit android cfg-gated files I comment the cfg-gate temporarily

vosk commented 1 month ago

This is a interesting idea, but of limited use. we can just do its own separate thing for video.

I agree. Also for audio maybe?

zmerp commented 1 month ago

I would say audio is not even a hot code path. The data size is much lower, plus we are not as sensitive to audio latency

vosk commented 1 month ago

For posterity: I tried keeping each shard in its own buffer and avoid resizing, and feeding deque_input_buffer with PARTIAL_FRAME, but the problem is that media codec ( I guess depending on codec implementation) may run out of buffers, thus forcing again to merge the shards in a contiguous buffer at the decoder. Somewhere along the way, I made streamsocket have less jitter, but time to photon actually measures worse.

zmerp commented 1 month ago

Ah ok. I think best thing is to do a compromise, by splitting the frame into a set number of packets to not exceed the max mediacodec input buffers.

vosk commented 1 month ago

I am having trouble with the decoder flags, if anyone is interested to test in another device, (just for the heck of it at this point), I can provide a branch to play around. I am testing on a Xiaomi Note 7 and the decoder doesnt really know what to do with the partial frames. It almost looks like that it does not support the flag, and simply tries to render the partial frame. Its weird. The top of the image is perfect, and the rest is extremely low bitrate. Its as if the first partial frame is all that it gets rendered.

zmerp commented 1 month ago

That's not the first time we see vendors not supporting Mediacodec flags. Seems the best option is not to mess with mediacodec at all.

alvr-org / ALVR

Suggestion: Speed up decoder by queue-ing queueInputBuffer (previous idea: BUFFER_FLAG_PARTIAL_FRAME) #2436