Use more complex codecs

Many codecs (e.g. Opus which is based on CELT) are based on Discrete Cosine Transform (DCT). Consequently, mixing in the frequency domain is also done by simple summation of the frequency samples.

Any help in this area is more than welcome!

EDIT: Summary of a discussion with @gmaxwell on #opus on irc.freenode.net. I will keep this list updated while getting new insights.

Transformation between time/frequency domain itself doesn't do any compression. Different approaches need different data representations.

Discussed approaches:

Entropy Coding: This is the most common approach. While it is possible to perform this operation based on (historical) statistical data, the encoded data is not additively homomorphic
Filter: This can and should be performed. The representation stays the same but improves the data quality or prepares the data for additional operations.
Example: Remove frequencies which the human ear can't hear.
Subband stretching: Encode subbands in which the human ear has a lower resolution with less accuracy.
Example (MP3): hybrid filter bank
Example (JPG): coefficient quantization
Relative coding: Assuming that natural data does not change a lot in small time intervals, encoding consecutive frames with relative numbers results in smaller numbers.
Approach: Combine multiple frames into macroframe. Only the first frame is encoded absolutely with more bits while the following ones are stored relatively.
Compressed sending: Assuming sparsity of natural signals.

The general idea is that senders send representative subsets of their data which are themselves additively homomorphic, then the receivers use fancy processing to recover the underlying audio from the sparse input (and a preprogrammed notion of what 'audio' usually looks like). So an example would be: downsampled PCM audio, irregularly sampled pcm audio, spectral band energies, or some (deterministic) random chunk of spectral lines potentially from multiple different window sizes. Basically any property that can be extracted from audio that is preserved for addition of the audio inputs. LPCNet is a speech codec which reconstructs the audio from a sparse representation using machine learning although the representation it uses is not additively homomorphic. I consider it an open question if you could get competitive compression performance from exclusively additively homomorphic features, it doesn't seem completely impossible to me, at least for a sufficiently relaxed definition of competitive.

@gmaxwell

Potential problems with a homomorphic MCU in general:

A malicious client could send a very loud signal such that no receiver won't understand any other signal any more. Since it is not clear who send this signal, this person can't be found easily in order to be kicked out or similar.
How can overflows be avoided?

As an alternative to a homomorphic MCU, @gmaxwell mentioned a distributed MCU. Concept:

You take the latencies of all participants between each other, and construct a minimum spanning tree with bounded degree (using kruskal alg). Each user will recieve a incoming audio feed from up to two other users, they'll decompress and mix them and theirs, and send the result up the tree. Once the final mix is generated, it's distributed back to everyone.

Bandwidth:

The result is 3x the bandwidth in (recieve two downstream feeds to be mixed, plus one upstream feed to be passed along) and 2x bandwidth out. So if 64kbps opus is acceptable quality, [the resulting bandwidth is only] 192kbps.

Problems:

It's main downside is latency (and maybe reliablity).

Also you could have clients playing silly games with the audio passing through them (like muting people they don't like).

Greetings. As we discussed in the opus IRC channel, although opus uses MDCT inside it, it is very much not additively homomorphic. I am extremely doubtful that any traditional codec could get good compression while being additively homomorphic.

In IRC I suggested an alternative to a homomorpic MCU, which is creating a distributed MCU. But as we discussed that has latency and reliability downsides.

I wanted to elaborate on one compression idea that might work, and another different alternative to compression which I only thought of after you left.

I mentioned the field of compressed sensing. Let me elaborate how I think compressed sensing might apply. The general idea is that senders send representative subsets of their data which are themselves additively homomorphic, then the receivers use fancy processing to recover the underlying audio from the sparse input (and a preprogrammed notion of what 'audio' usually looks like). So an example would be: downsampled PCM audio, irregularly sampled pcm audio, spectral band energies, or some (deterministic) random chunk of spectral lines potentially from multiple different window sizes. Basically any property that can be extracted from audio that is preserved for addition of the audio inputs. LPCNet is a speech codec which reconstructs the audio from a sparse representation using machine learning although the representation it uses is not additively homomorphic. I consider it an open question if you could get competitive compression performance from exclusively additively homomorphic features, it doesn't seem completely impossible to me, at least for a sufficiently relaxed definition of competitive.

The other idea I had: You mentioned 50 person conferences, presumably you are expecting almost all to be muted at any time or the conference will not be very usable. If you make a strong assumption that only a small number will transmit at once, then you could just append their packets without too much bandwidth. I didn't mention this because I figured it was obvious and you either didn't want to assume only a few would talk at once or you wanted to hide who was talking from the MCU.

After you left I realized the few-at-once hide-whos-talking problem is easily solved using DC-nets. The idea for that is that there are efficient representations of sets which are additively homomorphic. Each party treats its packet (if it has one) as a set element and encodes the (potentially empty) set into a representation N times larger (like say, 8 times). Then it gets encrypted, and sent the the mixer and the encryptions are added. The receivers decrypt, and so long as there were 8 or fewer packets sent at once, they can successfully decode and mix all of them. If more were sent, their set decode will fail or spit out garbage. Minisketch is a library for this kind of additive set processing, and although the current implementation is limited to extremely small packets (8 bytes) that's just an implementation limit ... and, in fact, LPCNet packets are actually 8 bytes in any case (though some overhead would be needed to indicate the sender). For minisketch the 'addition' that its homomorphic for is just xor, so the encryption can just be a plain stream cipher with a different key per sender. Assuming N is small, minisketch's approach would still likely be workably fast for pretty large (e.g. opus sized) packets.

Assuming 9 bytes were needed per packet (LPCNet + a sender byte), a conference with 256 participants where any 8 could transmit at once could be supported with each user sending and receiving 14400 bits per second from the MCU. Additional overhead to make over-send reliably detectable would increase that to 16200 bits per second. Presumably clients would play a polite beep when they detect over-transmit. MCU computation would just be xoring together all packets received.

Benchmarking minisketch for 8 potential packets at the 8-byte size (current library maximum) takes 0.02ms to decode on 2.25GHz Zen2. Extending it to 9 bytes will make it a fair bit slower (because clmul makes 64-bits unusually fast) but it's obviously fast at these sizes.

@gmaxwell: Thanks for following up on the conversation with new ideas!

I updated the original issue with a summary from the opus IRC channel. Please look over it if I got something wrong.

Thanks for your more detailed description of compressed sensing! It would be interesting to make a concept and do an experiment to see if it works in practice!

For my use case of 50 participant conferences, I actually mentioned the speech detection in order to mute non-talking participants in the following sentence on IRC. Anyways, I would like to extend my use case to a 50 people orchestra with each participant playing a single instrument simultaneously. Although this adds additional quality demands on the codec since music is much more challenging than just speech.

Can you please elaborate on the hide-whos-talking-problem a bit more - especially the DC-nets: Do all participants need to pairwise exchange a key or is a single key per sender sufficient? Who generates the key and who needs to know the key? Did I understand the combination correctly that DC-nets require XORable information which Minisketch provides and DC-nets calculate the combination/result also based on XOR? Is the result already the plaintext or is an additional decryption needed? If so with which key? Did I get you correctly that all participants send a message every time (e.g. an empty set)?

especially the DC-nets: Do all participants need to pairwise exchange a key or is a single key per sender sufficient?

Every sender needs to end up with its own cypher stream (like AES-CTR with a different sender key).

This could be accomplished by doing DH between all parties and using that key for each party to announce his symmetric key... or it could be accomplished by everyone in the chat sharing a master secret then tweaking it for each participant... or however.

Did I understand the combination correctly that DC-nets require XORable information which Minisketch provides and DC-nets calculate the combination/result also based on XOR?

DCNets require that data be composable by the same "addition" that the cypher is composable by. So for example, xor is F(2) addition and you can compose the encryption or decryption of a stream cipher with xor, and minisketch could be composed with XOR... so that works.

Alternatively, you could use linear (wrapping) PCM audio, which is composed by wrapping addition of 16-bit words and combine that with a 16-bit-word adding stream cipher. And that would also work.

Did I get you correctly that all participants send a message every time (e.g. an empty set)?

If you want to conceal whos transmitting, that's a necessary requirement.

Anyways, I would like to extend my use case to a 50 people orchestra with each participant playing a single instrument simultaneously.

For that case I think it's going to be hard outperform linear PCM with an strong noise shaping pre-post filter... which is at least really simple to implement.

Is the result already the plaintext or is an additional decryption needed? If so with which key?

Every sender has their own keystream e.g. AES-CTR(key=secret||userid, pos=packetnum). To encrypt data each sender xors his data with his stream. To decrypt data, users xor the received data with each sender's AES-CTR stream. xor doesn't care about what order you apply things in.

Or for linear PCM instead of xor you'd use a wrapping int16 addition to encrypt and combine, and a subtraction to decrypt because int16 addition is an appropriate addition operator for 16-bit pcm audio, and also is perfectly fine for combining a stream cipher.

Does the AES-CTR stream need to be synchronized? If so, what about lost packages?

Everyone agrees on a packet number that increments and doesn't decrement, in this kind of multiparty setup losing one person loses everyone unless you also send in packets metadata about who was lost.

darkdragon-001 / encrypted-mcu

Use more complex codecs #1