Open aymuos15 opened 3 months ago
Ideally resnet-18. Most vision dp benchmarks are with resnet-9/18
Update!
Alright, so I'm gonna push an update with some new tech and an example this evening, but as is, BloGS isn't able to provide very strong privacy guarantees for CNN's. This is because BloGS relies on the ratio of block size to parameter dimension in its privacy guarantees. CNN's often have a cascading architecture where parameter dimensions get bigger the deeper we get into the network. Starting off at 32 x 32 or 64 x 64 and getting up to 512 x 512 or more for larger networks. This causes the privacy loss for the smaller parameter groups to be large even with a block size of 1 (e.g. - loss of 10 epsilon for a block size of 1 with parameter size 32 or 64).
Given this, the privacy loss is inflated for smaller parameter groups, and this compounds across the composition of privacy loss across parameter groups. I need to go back to the theory and come up with new math for either:
So, TLDR:
BloGS is great with Transformers, not great with CNNs. Resolving this is gonna take me a month or so, probably. Maybe more. Or maybe I won't be able to resolve it, haha. We'll see! Fingers crossed.
Alright -- check it out @aymuos15 ! It's pushed. Example in the README.md file: https://github.com/dzagardo/forgetnet/
You can see the issue below in the pic. Most of the block sizes are 1, but the spent epsilon is around 1,000. CNN's are tricky. Working on a fix for this! Hoping I can finagle a unified theory for CNN's and Transformers, both small parameter regimes and large parameter regimes -- but we'll see. It is not easy, haha. Optimizing for one regime seems to sacrifice the other.
Thank you very very much for such a detailed explanation (and Ofcourse the example as well!)
This seems to be quite interesting.
Once I'm a bit more free, I will try to get this running for a vision transformer potentially? Do you think that will be a bit more realistic?
Regarding the ratio issue, is not possible to do some form of a pseudo-ratio or is that inherently breaking the privacy?
TLDR - I think we might be in business. Check out the example below. Some clear room for improvement, but definitely heading in the right direction!
BIG NOTE: I'm in the process of re-testing the membership inference and data extraction vulnerability. It's looking solid (knock on wood). Utility boost with a sliiiiight increase in vulnerability, but still comparable to DPSGD. Testing across the 5 models originally done in the paper. Tested at 1, 10, 100, 1000, and 10000
eps. Should have results for the between values (2.5, 5.0, etc) later tomorrow, and from there I'll hope to verify that it is statistically insignificant in its qualities relating to privacy protection compared to traditional DPSGD. With hopefully a statistically significant performance boost w.r.t. perplexity :)
And that's a really great idea @aymuos15, using Vision Transformers! So I made some pretty major updates to the implementation. I'd been taking a parameter-based approach, which has its merits -- stricter privacy guarantees -- but often tough on utility, even with traditional DPSGD. I went ahead and relaxed the implementation to a module/layer-based approach, and we're doing a lot better with our privacy/utility tradeoff (see layerwise clipping here: https://arxiv.org/abs/2307.11939)! The module-based approach is allowing us to get down to epsilon = 15 with Resnet in the example with a target of 1.0 epsilon. Two entire orders of magnitude better than the previous 1137 epsilon, which is pretty radical. Not perfect, but we're getting there. Composing the privacy over layers instead of parameters is allowing us to get a much better privacy accounting scenario since we don't have to compose as many epsilons for small parameter groupings.
Also, I tried some vision transformers and I was able to get down to 2.5 eps at a target of 1.0 eps with vit_small_patch16_224
, down to 1.00 eps at a target of 1.0 eps with vit_base_patch16_224
. The bigger, the better. But 2.5 is pretty solid for the small version! Stronger privacy guarantees than 15 eps with Resnet. Pre layer-wise approach, was getting down to 500 eps or so with Resnet using parameter-based.
The example in the README should run outta the box, and will get you to 15 eps. Using small will get you to 2.5 eps with 4.5GB GPU usage (vit_small_patch16_224
). If you swap out the model from the quickstart using the code below, should be able to get to 1.0 eps, though it does require running on an L4 GPU in Google Colab (total of 20GB RAM with the example):
Make sure to !pip install timm forgetnet==0.1.12
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from forgetnet import BloGSPrivacyEngine
import timm
import timm
import torch.nn as nn
def mnist_vit():
# Load the pre-trained Vision Transformer model without the head
model = timm.create_model('vit_base_patch16_224', pretrained=False, num_classes=0)
# Remove the classification head
model.head = nn.Identity()
# Add a new head for MNIST with 10 classes
# Access the number of features from the pre-logits layer
num_features = model.embed_dim # This attribute gives the number of output features from the transformer blocks
model.head = nn.Linear(num_features, 10) # Create a new linear layer with 10 outputs for MNIST
return model
# Device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Transformations: Resize to 224x224, convert to tensor, normalize
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.Grayscale(num_output_channels=3), # Convert to 3 channel
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)) # Standard ImageNet normalization
])
# Hyperparameters
batch_size = 64
learning_rate = 0.01
epochs = 10
target_epsilon = 1.0
delta = 1e-5
clip_value = 1.0
# Load MNIST dataset
train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
total_iterations = (len(train_dataset) // batch_size) * epochs
# Initialize model
model = mnist_vit().to(device)
# Initialize optimizer
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
# Wrap the optimizer with the PrivacyEngine
privacy_engine = BloGSPrivacyEngine(
optimizer=optimizer,
model=model,
target_epsilon=target_epsilon,
delta=delta,
clip_value=clip_value,
steps=total_iterations,
batch_size=batch_size
)
# Training loop
model.train()
for epoch in range(epochs):
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
privacy_engine.zero_grad()
output = model(data)
loss = nn.functional.cross_entropy(output, target)
loss.backward()
epsilon_spent, delta = privacy_engine.step()
if batch_idx % 100 == 0:
print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}, Epsilon: {epsilon_spent:.4f}')
# Get total privacy spent after training
total_epsilon_spent = privacy_engine.get_privacy_spent()
print(f"Total privacy spent: ε = {total_epsilon_spent:.4f}")```
Awesome, great news @aymuos15 -- on average, Layer-wise DP-BloGS is significantly more performative by about 16% w.r.t. perplexity. There is no significant difference between membership inference vulnerability and data leakage comparing Layer-wise DP-BloGS to DP-SGD :) tested across 5 models trained 30 times each at different values of epsilon.
Hi @dzagardo
Once again, thank you very much for all the work and such detailed explanations. I think there are a few terms and insights which I am not able to fully comprehend yet. Let me quickly read up on those topics and I will get back to this shortly. Hope thats okay.
Using small will get you to 2.5 eps with 4.5GB GPU usage (vit_small_patch16_224)
---> I am extremely excited about this:)
All good, keep me posted -- lemme know if that gets you sorted @aymuos15 !
Lemme take a look! What model are you working with?