Implementation's sensitivity to network latency

weikengchen commented 3 years ago

We know QuickSilver is almost constant-round.

offline phase grows linearly to the statement size but extremely slow, due to the iterated LPN
online phase is three-round

The current implementation, however, seems to be quite sensitive to the network latency. In a statement that I am proving,

if the RTT = 4ms, the total time is 32s.
if the RTT = 40ms, the total time is 89s.

This time seems to be less relevant to the network bandwidth. I also tried changing the TCP congestion control algorithm to HTCP, with a 32MB TLS buffer, but there is basically no change.

After looking at the code, I am still unsure about what happens. Some early researching:

BoolIO flushes frequently but removing such flushes does not help the total time.
The code currently calls the io->send_data for many, many small elements, during inputting and multiplying (where most of the latency comes from).

Nevertheless, a potential optimization is to delay all the input/multiplying until revealing---aka, call io->send three times during the online phase. This is complicated, yet might be generally useful. Instead of hoping the network can figure out the correct way to send data, we can instead make big IO calls.

A PR might be incoming.

wangxiao1254 commented 3 years ago

calling send_data multiple times should not get impacted by RTT a lot, because all calls are in the same direction anyway. right now, I think we need to do <5 roundtripss per 10^7 gates. What's your statement size? Is it boolean or arithmetic?

weikengchen commented 3 years ago

Yes, exactly---that is why I am unsure.

I am doing a statement in arithmetic polynomial proof systems.

Hard to estimate the number of gates, but it used 7 million OT triples from VOLE.

weikengchen commented 3 years ago

For the information, the communication taken from io->counter is 500MB. But varying the bandwidth seems to affect little.

wangxiao1254 commented 3 years ago

OOOh, change https://github.com/emp-toolkit/emp-zk/blob/master/emp-zk/emp-zk-arith/ostriple.h#L12 to 1 and see if it works. 8 means we do checking 8 times more frequently (to reduce memory).

For boolean, we used FS so the frequency of checking does not matter much, but for arithmetic, we cannot use FS (not enough concrete security)

wangxiao1254 commented 3 years ago

BTW, the boundary between offline/online phase is not really very clear in our case. To have a clear offline phase, we need to do offline for all gate and store them, and this is a huge memory cost. What we do is that we prepare offline phase for 10^7 each time and then do online for 10^7. Then we switch to offline and do 10^7 of them again. So the #rounds ~ #gates/10^7*small constant

weikengchen commented 3 years ago

After the change in ostriple, the time is still similar.

I will come back to this one later. If useful, I can clean up my code a little bit and share a repo.

And yes. Luckily in my situation, the one big bag of OT triples (10^7) is sufficient for the statement (~6 * 10^6).

~Probably also interesting---even in the same machine, proving a statement of 6 * 10^6 is about 2x-3x slower than proving a tiny statement, though both of them use one bag of OT triples (same offline phase). This may reflect that the major overhead is not in the offline time, but in the online time. Does that match the expectation from the protocol?~

weikengchen commented 3 years ago

A correction that the number of VOLE being used is 77 million. So there are indeed 8 calls to extension. The last reply contains an incorrect number.

wangxiao1254 commented 3 years ago

Let me summarize: there are two issues: 1) the per-multiplication speed decreases significantly as the latency increase; 2) the per-multiplication speed decreases significantly as the batch size becomes small, even without counting the speed of the offline phase. I suppose all are about the arithmetic setting?

weikengchen commented 3 years ago

I will do a more controlled experiment.

Do you mean the per-multiplication speed "increases" or "decreases" significantly when the latency increases? And here the batch size refers to FP_CHECK_DIV_SZ in ostriple.h specifically?

wangxiao1254 commented 3 years ago

updated! that's right. batch_size = 10^7/FP_CHECK_DIV_SZ

emp-toolkit / emp-zk

Implementation's sensitivity to network latency #7