data61 / MP-SPDZ

Versatile framework for multi-party computation
Other
925 stars 278 forks source link

About random number generation in secure training #1298

Closed sankha555 closed 8 months ago

sankha555 commented 8 months ago

Hi @mkskeller ,

I have quick question regarding random number generation using the get_random() method (used for initializing model weights and shuffling training data in neural network training):

While running secure training using 2 parties (I am using semi2k-party.x protocol), does the random number initialization by itself generate fresh samples every time? I am not using any preprocessing here at all.

The reason I am asking this is that when I use the preprocessing and the number of random bits generated beforehand is too small (around 1000), the random numbers generated are all the same, which leads to reduced training accuracy. My final work is going to be done in the non-preprocessing model, so I wanted to be sure that is the random number generation correct in that case (fresh samples in every call to get_random()?

Thanks!

mkskeller commented 8 months ago

I assume you use sfix.get_random() in the context of machine learning. The default operation is indeed to read from the preprocessing files as the random bits used are independent of the inputs and thus part of preprocessing. ed8bdffcf0160d8f224e980b2332a0cc86a88fea adds the option to use public randomness which is generated at run time: https://mp-spdz.readthedocs.io/en/latest/Compiler.html#Compiler.types.sfix.get_random

sankha555 commented 8 months ago

I see, I'll try that.

At this point, I don't have any random bits stored in the files (I cleaned the player-data folder) and am running secure training. However, I can still see the protocol generating random values. Where are those values from? I have not passed the public_randomness parameter yet.

Could it be possible that the protocol generates them from preprocessing bits generated from scratch during the protocol run?

On Tue, 13 Feb, 2024, 7:39 am Marcel Keller, @.***> wrote:

I assume you use sfix.get_random() in the context of machine learning. The default operation is indeed to read from the preprocessing files as the random bits used are independent of the inputs and thus part of preprocessing. ed8bdff https://github.com/data61/MP-SPDZ/commit/ed8bdffcf0160d8f224e980b2332a0cc86a88fea adds the option to use public randomness which is generated at run time: https://mp-spdz.readthedocs.io/en/latest/Compiler.html#Compiler.types.sfix.get_random

— Reply to this email directly, view it on GitHub https://github.com/data61/MP-SPDZ/issues/1298#issuecomment-1940239326, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANBW7I27ULEJHJTAKCH2QXLYTLDPFAVCNFSM6AAAAABDEHL5EGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBQGIZTSMZSGY . You are receiving this because you authored the thread.Message ID: @.***>

-- The information contained in this electronic communication is intended solely for the individual(s) or entity to which it is addressed. It may contain proprietary, confidential and/or legally privileged information. Any review, retransmission, dissemination, printing, copying or other use of, or taking any action in reliance on the contents of this information by person(s) or entities other than the intended recipient is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us by responding to this email or telephone and immediately and permanently delete all copies of this message and any attachments from your system(s). The contents of this message do not necessarily represent the views or policies of BITS Pilani.

mkskeller commented 8 months ago

I meant to say that when using preprocessing from files with -F, you can indeed see repeated randomness as it may be reused. This doesn't affect operation without -F or -f. Does this not match your experience?

sankha555 commented 8 months ago

I am currently running the protocol without -F. So I am assuming the randomness is not being read from the file.

But I also observe the randomness to be normal in this case. Consider the below comparison:

When running with -F, the random vector generated is -0.063332, -0.063332, -0.063332, -0.063332, -0.063332 (all same values) When running without -F, the random vector generated is -0.048302, 0.07854, 0.099932, -0.00432, 0.012158 (different random values every time, as should be the normal case) [This is also the desirable case for me. I just want to be sure that this randomness is being generated from scratch everytime the program runs and that the cost for generating this is also included in the data and time costs that are reported at the end of the protocol]

My question is that since I am running the protocol without -F, there are no preprocessed data files in the Player-Data directory and I am not using the public_randomness parameter, where is this randomness mentioned in the 2nd example above coming from? Is it being generated on the fly and its data costs are being included in the stats in the end?

Thank you, really appreciate the time you've been putting in! :))

On Tue, Feb 13, 2024 at 9:17 AM Marcel Keller @.***> wrote:

I meant to say that when using preprocessing from files with -F, you can indeed see repeated randomness as it may be reused. This doesn't affect operation without -F or -f. Does this not match your experience?

— Reply to this email directly, view it on GitHub https://github.com/data61/MP-SPDZ/issues/1298#issuecomment-1940377497, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANBW7I2CMD7YQKPQIGWVBLLYTLO4FAVCNFSM6AAAAABDEHL5EGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBQGM3TONBZG4 . You are receiving this because you authored the thread.Message ID: @.***>

-- The information contained in this electronic communication is intended solely for the individual(s) or entity to which it is addressed. It may contain proprietary, confidential and/or legally privileged information. Any review, retransmission, dissemination, printing, copying or other use of, or taking any action in reliance on the contents of this information by person(s) or entities other than the intended recipient is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us by responding to this email or telephone and immediately and permanently delete all copies of this message and any attachments from your system(s). The contents of this message do not necessarily represent the views or policies of BITS Pilani.

mkskeller commented 8 months ago

The public randomness would be generated from a shared PRNG and not accounted for. I also think it's a bit odd that the randomness is this repetitive when using -F. What code and compilation is this appearing with? And how have you generated the preprocessing?

sankha555 commented 8 months ago

This repetitive randomness is when training a 2 layer neural network using semi2k-party.x (the random vector is the initial weights). I used the command ./Fake-Offline.x -lgp 2 1000 for generating this randomness. I assume it is due to the low number of bits that this randomness is repetitive.

But one question still remains unanswered. Let me state it in two parts:

i) The code shows that the public randomness switch is by default False. I myself have not explicitly switched on public randomness. I am also NOT running the protocol using the -F flag. In this case, the random numbers generated are perfect (they are different every time). I am not able to understand how the perfect randomness gets generated here, since the public randomness (or PRNG) is not being used and neither is there any preprocessed bits being read from files.

ii) Before this commit was pushed, if someone were running neural network training without the -F flag, how was the randomness being generated? I am assuming if the -F flag is not present, no files are read for the proprocessing bits.

Things might sound a bit repetitive now, apologies for that. But I really need to understand how the internals are working because I need an accurate profiling of the protocol in terms of time and data costs incurred in all steps.

Thank you!

mkskeller commented 8 months ago

Are you sure about ./Fake-Offline.x -lgp 2 1000? This would generate preprocessing for 1000 parties on 2-bit primes, which probably wouldn't work for machine learning.

The proper randomness generation (no -F and no public randomness) depends on the protocol and further options. One possibility is to use edaBits (https://eprint.iacr.org/2020/338) but the default is to use random bit generation. One possibility here is to compute the XOR between bits input by the parties, which guarantees that the resulting bit is uniformly random if at least on the inputs is.

sankha555 commented 8 months ago

Yes, that's how I calculated the fake offline bits. I suppose increasing the value from 2 to something like 16 will be better?

One possibility here is to compute the XOR between bits input by the parties, which guarantees that the resulting bit is uniformly random if at least on the inputs is.

I see, this makes sense. So does the semi2k-party.x protocol use this mode for random number generation by default? If this is the case, it fits perfectly for my usecase.

mkskeller commented 8 months ago

Yes, that's how I calculated the fake offline bits. I suppose increasing the value from 2 to something like 16 will be better?

I can't imagine that this works at all. Are you really running 1000 parties?

One possibility here is to compute the XOR between bits input by the parties, which guarantees that the resulting bit is uniformly random if at least on the inputs is.

I see, this makes sense. So does the semi2k-party.x protocol use this mode for random number generation by default? If this is the case, it fits perfectly for my usecase.

Yes.

sankha555 commented 8 months ago

I can't imagine that this works at all. Are you really running 1000 parties?

No, I am just running 2. But doesn't this run it for a subset of the total parties too? Sorry if I misunderstood the working here.

Yes.

Thank you!

mkskeller commented 8 months ago

No. You have to specify the exact number of parties for it to work. If it still works for another number of parties, you might have some preprocessing files lying around from earlier call. Generally, I'd recommend to remove the entire Player-Data directory whenever you see unusual behaviour with preprocessing files as you already did earlier.

sankha555 commented 8 months ago

I see, will take care of that. Thanks a lot for the extended help throughout! :)