Xtra-Computing / FedTree

A tree-based federated learning system (MLSys 2023)
https://fedtree.readthedocs.io/en/latest/index.html
Apache License 2.0
140 stars 38 forks source link

A question about running vertical FL in standalone simulation #43

Open blziz opened 1 year ago

blziz commented 1 year ago

Hi, I am confused about missing_gh variable in function compute_histogram_in_a_level. When using privacy_tech=he, part of the value of missing_gh is in plaintext and the other part is in ciphertext, is this correct? I printed a plaintext message as follows

missing_gh_data[pid] = -340.000000/757.000000;
nodes_data[nid].sum_gh_pair = -340.000000/757.000000; 
node_gh = 0.000000/0.000000;

Is this a security risk?

blziz commented 1 year ago
missing_gh_data[pid].encrypted = false;
missing_gh_data[pid].g_enc = 0, missing_gh_data[pid].h_enc = 0;
QinbinLi commented 1 year ago

Hi @blziz ,

In vertical FL, since one party (i.e., the aggregator) has the labels and can compute the raw gradients locally, it does not need to compute missing_gh based on encrypted gradients. The party with the labels will not send missing_gh to others so it's secure.

blziz commented 1 year ago

Thank you! I confirmed that this situation occurs in parties without labels in vertical FL. The parameter settings are as follows

data=./dataset/test_dataset.txt
test_data=./dataset/test_dataset.txt
model_path=fedtree.model
partition_mode=vertical
n_parties=1
mode=vertical
privacy_tech=he
n_trees=40
depth=6
learning_rate=0.2
partition=1

and in homo_partition(),

for (int i = 0; i < n_parties; i++) {
        if (is_horizontal) {    ...    }
        if (!is_horizontal) {
            subsets[i].y = dataset.y;
            if(i == 0)
                subsets[i].has_label = false;
            else
                subsets[i].has_label = true;
        }
        ...
    }
QinbinLi commented 1 year ago

Hi @blziz ,

Thanks a lot for your information! There indeed exists possible security risks. The unencrypted missing_gh is caused by the sharing of the whole tree model among all parties in vertical FL, and the unencrypted missing_gh actually leaks no more information than the model itself. We are currently working on a version without sharing the whole model which is more secure. Also, we notice the following issues.

  1. For homo_partition(), we find that the label splitting is not correct. In the simulation, when i==0 (party id = 0), it is the host party and it should have the label. Otherwise, they are guest parties and are supposed to only have features. We have fixed it.

  2. You need to set n_parties >= 2 to simulate a reasonable federated learning scenario. In vertical FL, at least one of the parties has the labels. In our simulation, party 0 has the labels and the other parties do not have the labels.