csong27 / membership-inference

Code for Membership Inference Attack against Machine Learning Models (in Oakland 2017)
176 stars 63 forks source link

About Algorithm 1 Data Synthesis Using the Target Model #9

Closed icmpnorequest closed 4 years ago

icmpnorequest commented 4 years ago

Hi Dr. Song,

Thank you for providing us with the source code of the paper. I have been reading and repeating the experiment mentioned in the paper. However, I found that all the training dataset for shadow models just using the data records disjoint from target training dataset of specific dataset (like cifar-10) or replace k features in the code or other experiment implementations, like ml-leaks, cyphercat, mia and etc. Maybe, it could be a little bit different from the original algorithm in the paper.

I wrote the Algorithm 1: Data synthesis using the target model by myself using Pytorch. I generated a random tensor as size of (1, 3, 32, 32) for cifar-10 dataset and use two phases-search and sample as the algorithm in the paper. The code is as below:

def data_synthesize(net, trainset_size, fix_class, initial_record, k_max,
                    in_channels, img_size, batch_size, num_workers, device):
    """
    It is a function to synthesize data
    """
    # Initialize X_tensor with an initial_record, with size of (1, in_channels, img_size, img_size)
    X_tensor = initial_record
    # Generate y_tensor with the size equivalent to X_tensor's
    y_tensor = gen_class_tensor(trainset_size, fix_class)

    y_c_current = 0         # target models probability of fixed class
    j = 0                   # consecutive rejections counter
    k = k_max               # search radius
    max_iter = 100          # max iter number
    conf_min = 0.1          # min probability cutoff to consider a record member of the class
    rej_max = 5             # max number of consecutive rejections
    k_min = 1               # min radius of feature perturbation

    for _ in range(max_iter):

        dataset = TensorDataset(X_tensor, y_tensor)
        dataloader = DataLoader(dataset=dataset, batch_size=batch_size, num_workers=num_workers, shuffle=True)

        y_c = nn_predict_proba(net, dataloader, device, fix_class)

        # Phase 1: Search
        if y_c >= y_c_current:
            # Phase 2: Sample
            if y_c > conf_min and fix_class == torch.argmax(nn_predict(net, dataloader, device), dim=1):
                return X_tensor

            X_new_tensor = X_tensor
            y_c_current = y_c  # renew variables
            j = 0
        else:
            j += 1
            if j > rej_max:  # many consecutive rejects
                k = max(k_min, int(np.ceil(k / 2)))
                j = 0
        X_tensor = rand_tensor(X_new_tensor, k, in_channels, img_size, trainset_size)

    return X_tensor, y_c

However, the prediction probability it generates is so low, like 0.1. Could you please give me some guidance on the Data Synthesis Using the Target Model Algorithm or update the uploaded code? Thanks in advance for your patience!

Best wish!

Yantong

shiwen1997 commented 4 years ago

Hi,bro I don't understand this 2 lines code of the paper : 10: if rand() < yc then # sample 11: return x #8 synthetic data could you explian it ? thanks!

icmpnorequest commented 4 years ago

Hi,bro I don't understand this 2 lines code of the paper : 10: if rand() < yc then # sample 11: return x #8 synthetic data could you explian it ? thanks!

You could refer to the Phase2: sampling in the paper. yc means "a potentially higher classification probability". If randomly choosing a record and its probability is smaller than yc, we could sample the synthetic data. That's what I understand. Hope it could help you.

mjuarezm commented 4 years ago

@icmpnorequest I also had to write the procedure to synthesize the dataset used to train the shadow models myself. I am disappointed that this is not published in this repository as I think it's fundamental for the black-box setting. I think Shokri et al. use a search algorithm based on hill-climbing that maximizes the confidence score. @csong27 : could you, please, upload the code that you use to generate the synthetic datasets for the shadow models? Or otherwise explain why it is not included in this repository? Thank you.

csong27 commented 4 years ago

Hi, please contact Reza Shokri (reza@comp.nus.edu.sg) for implementation of synthesizing data.

icmpnorequest commented 4 years ago

@csong27 Thank you so much!