Closed h-ahmad closed 2 years ago
@h-ahmad, please provide SWCI, SWOP, SL and user(ML) containers logs.
SWCI log:
SWOP log:
SL log:
The last image looks like a ML log, not a SL log.
Do you have the SL logs?
This issue looks like the ML sent data to the SL, but It appears that no response has been returned from SL to ML.
Therefore, the SL logs may be the key to resolving the issue.
The last image looks like a ML log, not a SL log.
Do you have the SL logs?
This issue looks like the ML sent data to the SL, but It appears that no response has been returned from SL to ML.
Therefore, the SL logs may be the key to resolving the issue.
I have tried to capture the logs of all 7 containers taking part in the process. The screenshots are taken at the end ($docker logs #container_id --follow) command when error occured. All the screenshots are in order with the list of containers given (with numbers) in the first image as follows.
SN log:
swop log:
swci log:
The logs alone do not seem to show the cause of the problem, to me.
Also, I have not been able to reproduce this problem.
Would it be possible for you to give me details of what you have changed from your example?
What I have changed in the existing code is:
Loss function changed from log likelihood to Entropy loss The rest of the code is same as given in the example as given below. `import os from swarmlearning.pyt import SwarmCallback import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from torchmetrics.classification import Accuracy, Precision, Recall, Specificity, F1Score, AUROC, AUC, ConfusionMatrix from custom1.data_loader import Cifar10Loader import torchvision.transforms as transforms transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) directory = os.getenv('DATA_DIR', '/platform/scratch') csv_path = os.path.join(directory, 'labels1.csv') data_path = os.path.join(directory, 'train1') trainset = Cifar10Loader(csv_path, data_path, transform) test_csv = os.path.join(directory, 'test.csv') test_data = os.path.join(directory, 'test') testset = Cifar10Loader(test_csv, test_data, transform) device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") default_max_epochs = 2 default_min_peers = 2 trainPrint = True swSyncInterval = 128 loss_fn = torch.nn.CrossEntropyLoss(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean') class Net(nn.Module): def init(self): super().init() self.conv1 = nn.Conv2d(3, 6, 5) self.pool = nn.MaxPool2d(2, 2) self.conv2 = nn.Conv2d(6, 16, 5) self.fc1 = nn.Linear(16 5 5, 120) self.fc2 = nn.Linear(120, 84) self.fc3 = nn.Linear(84, 10)
def forward(self, x): x = self.pool(F.relu(self.conv1(x))) x = self.pool(F.relu(self.conv2(x))) x = torch.flatten(x, 1) # flatten all dimensions except batch x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x
def doTrainBatch(model,device,trainLoader,optimizer,epoch,swarmCallback): model.train() for batchIdx, (data, target) in enumerate(trainLoader, 0): data, target = data.to(device), target.to(device) optimizer.zero_grad() output = model(data)
#loss = F.nll_loss(output, target[0][0])
**loss = loss_fn(output, target[0][0])**
loss.backward()
optimizer.step()
if trainPrint and batchIdx % 100 == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batchIdx * len(data), len(trainLoader.dataset),
100. * batchIdx / len(trainLoader), loss.item()))
if swarmCallback is not None:
swarmCallback.on_batch_end()
def test(model, device, testLoader): model.eval() testLoss = 0 correct = 0
**precision = Precision(num_classes=10, average='weighted').to(device)
recall = Recall(num_classes=10, average='weighted').to(device)
specificity = Specificity(num_classes=10, average='weighted').to(device)
f1_score = F1Score(num_classes=10, average='weighted').to(device)**
#auroc = AUROC(num_classes=10, average='weighted', compute_on_step=False).to(device)
with torch.no_grad():
for i, (data, target) in enumerate(testLoader, 0):
data, target = data.to(device), target.to(device)
output = model(data)
#output = output.argmax(1)
target = target[0][0]
#testLoss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
testLoss += loss_fn(output, target)
pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
correct += pred.eq(target.view_as(pred)).sum().item()
#acc.update(pred, target)
**precision.update(pred, target)
recall.update(pred, target)
specificity.update(pred, target)
f1_score.update(pred, target)**
#auroc.update(pred, target)
testLoss /= len(testLoader.dataset)
print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
testLoss, correct, len(testLoader.dataset),
100. * correct / len(testLoader.dataset)))
**print("Accuracy: ", correct / len(testLoader.dataset))
print("Precision: ", precision.compute())
print("Precision: ", precision.compute())
print("Recall: ", recall.compute())
print("Specifity: ", specificity.compute())
print("F1-score: ", f1_score.compute())**
#print("AUROC: ", auroc.compute())
def main(): scratchDir = os.getenv('SCRATCH_DIR', '/platform/scratch') max_epochs = int(os.getenv('MAX_EPOCHS', str(default_max_epochs))) min_peers = int(os.getenv('MIN_PEERS', str(default_min_peers))) batchSz = 1 model = Net().to(device) model_name = 'mnist_pyt' opt = optim.Adam(model.parameters()) trainLoader = torch.utils.data.DataLoader(trainset,batch_size=batchSz) testLoader = torch.utils.data.DataLoader(testset,batch_size=batchSz)
swarmCallback = None
swarmCallback = SwarmCallback(syncFrequency=swSyncInterval,
minPeers=min_peers,
useAdaptiveSync=False,
adsValData=testset,
adsValBatchSize=batchSz,
model=model)
swarmCallback.on_train_begin()
for epoch in range(1, max_epochs + 1):
doTrainBatch(model,device,trainLoader,opt,epoch,swarmCallback)
test(model,device,testLoader)
swarmCallback.on_epoch_end(epoch)
swarmCallback.on_train_end()
saved_model_path = os.path.join(scratchDir, model_name, 'saved_model.pt')
os.makedirs(scratchDir, exist_ok=True)
os.makedirs(os.path.join(scratchDir, model_name), exist_ok=True)
torch.save(model, saved_model_path)
print('Saved the trained model!')
if name == 'main': main()`
Below code is the custom loader class.
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import os
from skimage import io, transform
import numpy as np
class Cifar10Loader(Dataset):
def init(self, csv_path, dataset_path, transform=None):
self.image_names = pd.read_csv(csv_path)
self.data_path = dataset_path
self.transform = transform
def len(self):
return len(self.image_names)
def getitem(self, index):
if torch.is_tensor(index):
index = index.tolist()
img_name = os.path.join(self.data_path, self.image_names.iloc[index, 0])
img = io.imread(img_name)
label = self.image_names.iloc[index, 1:]
label = np.array([label])
label = label.astype('float')
label = torch.from_numpy(label)
label = label.type(torch.LongTensor)
if self.transform:
img = self.transform(img)
return img, label
Are you sure this code works fine without swarm learning?
Are you running it with a batch size of 1? I don't think it would learn well with a batch size of 1.
batchSz = 1
Yes. It is working on local machine without swarm learning. However, it gives error when batch size is not 1. Moreover, it is wroking with epoch as 2. The problem occurs when increased number of epochs. If the network is small, it goes up to almost 70 epochs. But for this network, I have written, it gives error while running epoch 3. I am surprised that it works for smaller epoch and may be for limited time.
The following situations do not seem to indicate that local learning is working well, and I recommend that these issues be resolved before using swarm learning.
The following situations do not seem to indicate that local learning is working well, and I recommend that these issues be resolved before using swarm learning.
- it gives error when batch size is not 1.
- it is wroking with epoch as 2. The problem occurs when increased number of epochs.
- for this network, I have written, it gives error while running epoch 3.
Okay. Thanks for your time and effort. I will inform you after I rectify the issues you mentioned. Thanks.
The following situations do not seem to indicate that local learning is working well, and I recommend that these issues be resolved before using swarm learning.
- it gives error when batch size is not 1.
- it is wroking with epoch as 2. The problem occurs when increased number of epochs.
- for this network, I have written, it gives error while running epoch 3.
I have run the example process with default setting and mnist model (mnist_pyt) with mnist dataset. I have just changed 'MAX_EPOCH=2' to 'MAX_EPOCH=100' in the 'swci/taskdef/swarm_mnist_task.yaml' file. It has still given the error at epoch 95. The error is given below (as in the first screenshot): swarmlearning.client.swarm.SwarmError: Sync Swarm call to SL container failed - wanted: 22 bytes, got 0 bytes: pipe closed?
@h-ahmad, could you please run same exercise by removing last two RESETS in SWCI. I mean SWCI 29 - RESET TASKRUNNER command, SWCI 33 - RESET CONTRACT commands.
@h-ahmad, could you please run same exercise by removing last two RESETS in SWCI. I mean SWCI 29 - RESET TASKRUNNER command, SWCI 33 - RESET CONTRACT commands.
Thanks to both of you. You helped and resolved my issue. I am closing the issue.
Issue description
OS and ML Platform
Quick Checklist: Respond [Yes/No]
Additional notes
NOTE: Create an archive with supporting artifacts and attach to issue, whenever applicable.