h-ahmad commented 2 years ago

Issue description

issue description: Sync Swarm call to SL container failed
occurrence - consistent or rare: After some epochs, e.g 3 epochs.
error messages: Sync Swarm call to SL container failed - wanted: 22 bytes, got 0 bytes: pipe closed?
commands used for starting containers: swci
docker logs [APLS, SPIRE, SN, SL, SWCI]: SL log attached
OS and ML Platform
details of host OS: Ubuntu 20.04
details of ML platform used: Ubuntu 20.04 running example given here with changed network model and dataset.
details of Swarm learning Cluster (Number of machines, SL nodes, SN nodes): Default

Quick Checklist: Respond [Yes/No]

APLS server web GUI shows available Licenses? Yes
If Multiple systems are used, can each system access every other system? Yes
Is Password-less SSH configuration setup for all the systems? Yes
If GPU or other protected resources are used, does the account have sufficient privileges to access and use them? Yes
Is the user id a member of the docker group? Yes

Additional notes

Are you running documented example without any modification? No. I am running other model with other data. 2 epochs runs successfully. But for larger epoch, it gives error shown in the attachment.
NOTE: Create an archive with supporting artifacts and attach to issue, whenever applicable.

iArpanPatel commented 2 years ago

@h-ahmad, please provide SWCI, SWOP, SL and user(ML) containers logs.

h-ahmad commented 2 years ago

SWCI log: swci

SWOP log: swop

SL log:

IMOKURI commented 2 years ago

The last image looks like a ML log, not a SL log.

Do you have the SL logs?

This issue looks like the ML sent data to the SL, but It appears that no response has been returned from SL to ML.

Therefore, the SL logs may be the key to resolving the issue.

h-ahmad commented 2 years ago

The last image looks like a ML log, not a SL log.

Do you have the SL logs?

This issue looks like the ML sent data to the SL, but It appears that no response has been returned from SL to ML.

Therefore, the SL logs may be the key to resolving the issue.

I have tried to capture the logs of all 7 containers taking part in the process. The screenshots are taken at the end ($docker logs #container_id --follow) command when error occured. All the screenshots are in order with the list of containers given (with numbers) in the first image as follows. main

SN log:
swop log:
swci log:

IMOKURI commented 2 years ago

The logs alone do not seem to show the cause of the problem, to me.

Also, I have not been able to reproduce this problem.

Would it be possible for you to give me details of what you have changed from your example?

h-ahmad commented 2 years ago

What I have changed in the existing code is:

The dataset changed from MNIST to CIFAR-10 which is placed in the local repository used with the custom loader.
The 'MAX_EPOCH=2' to 'MAX_EPOCH=10' in the 'swci/taskdef/swarm_mnist_task.yaml' file.
Evaluation metrics from torchmetrics.classification added.
Network module is changed
Loss function changed from log likelihood to Entropy loss The rest of the code is same as given in the example as given below. `import os from swarmlearning.pyt import SwarmCallback import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from torchmetrics.classification import Accuracy, Precision, Recall, Specificity, F1Score, AUROC, AUC, ConfusionMatrix from custom1.data_loader import Cifar10Loader import torchvision.transforms as transforms transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) directory = os.getenv('DATA_DIR', '/platform/scratch') csv_path = os.path.join(directory, 'labels1.csv') data_path = os.path.join(directory, 'train1') trainset = Cifar10Loader(csv_path, data_path, transform) test_csv = os.path.join(directory, 'test.csv') test_data = os.path.join(directory, 'test') testset = Cifar10Loader(test_csv, test_data, transform) device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") default_max_epochs = 2 default_min_peers = 2 trainPrint = True swSyncInterval = 128 loss_fn = torch.nn.CrossEntropyLoss(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean') class Net(nn.Module): def init(self): super().init() self.conv1 = nn.Conv2d(3, 6, 5) self.pool = nn.MaxPool2d(2, 2) self.conv2 = nn.Conv2d(6, 16, 5) self.fc1 = nn.Linear(16 5 5, 120) self.fc2 = nn.Linear(120, 84) self.fc3 = nn.Linear(84, 10)

def forward(self, x): x = self.pool(F.relu(self.conv1(x))) x = self.pool(F.relu(self.conv2(x))) x = torch.flatten(x, 1) # flatten all dimensions except batch x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x

def doTrainBatch(model,device,trainLoader,optimizer,epoch,swarmCallback): model.train() for batchIdx, (data, target) in enumerate(trainLoader, 0): data, target = data.to(device), target.to(device) optimizer.zero_grad() output = model(data)

output = output.argmax(1)

    #loss = F.nll_loss(output, target[0][0])
    **loss = loss_fn(output, target[0][0])**
    loss.backward()
    optimizer.step()
    if trainPrint and batchIdx % 100 == 0:
        print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
              epoch, batchIdx * len(data), len(trainLoader.dataset),
              100. * batchIdx / len(trainLoader), loss.item()))
    if swarmCallback is not None:
        swarmCallback.on_batch_end()

def test(model, device, testLoader): model.eval() testLoss = 0 correct = 0

acc = Accuracy(num_classes=13, average='weighted').to(device)

**precision = Precision(num_classes=10, average='weighted').to(device)
recall = Recall(num_classes=10, average='weighted').to(device)
specificity = Specificity(num_classes=10, average='weighted').to(device)
f1_score = F1Score(num_classes=10, average='weighted').to(device)**
#auroc = AUROC(num_classes=10, average='weighted', compute_on_step=False).to(device)
with torch.no_grad():
    for i, (data, target) in enumerate(testLoader, 0):
        data, target = data.to(device), target.to(device)
        output = model(data)
        #output = output.argmax(1)
        target = target[0][0]
        #testLoss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
        testLoss += loss_fn(output, target)
        pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
        correct += pred.eq(target.view_as(pred)).sum().item()
        #acc.update(pred, target)
        **precision.update(pred, target)
        recall.update(pred, target)
        specificity.update(pred, target)
        f1_score.update(pred, target)**
        #auroc.update(pred, target)

testLoss /= len(testLoader.dataset)

print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
    testLoss, correct, len(testLoader.dataset),
    100. * correct / len(testLoader.dataset)))    
**print("Accuracy: ", correct / len(testLoader.dataset))
print("Precision: ", precision.compute())
print("Precision: ", precision.compute())
print("Recall: ", recall.compute())
print("Specifity: ", specificity.compute())
print("F1-score: ", f1_score.compute())**
#print("AUROC: ", auroc.compute())

def main(): scratchDir = os.getenv('SCRATCH_DIR', '/platform/scratch') max_epochs = int(os.getenv('MAX_EPOCHS', str(default_max_epochs))) min_peers = int(os.getenv('MIN_PEERS', str(default_min_peers))) batchSz = 1 model = Net().to(device) model_name = 'mnist_pyt' opt = optim.Adam(model.parameters()) trainLoader = torch.utils.data.DataLoader(trainset,batch_size=batchSz) testLoader = torch.utils.data.DataLoader(testset,batch_size=batchSz)

swarmCallback = None
swarmCallback = SwarmCallback(syncFrequency=swSyncInterval,
                              minPeers=min_peers,
                              useAdaptiveSync=False,
                              adsValData=testset,
                              adsValBatchSize=batchSz,
                              model=model)
swarmCallback.on_train_begin()

for epoch in range(1, max_epochs + 1):
    doTrainBatch(model,device,trainLoader,opt,epoch,swarmCallback)
    test(model,device,testLoader)
    swarmCallback.on_epoch_end(epoch)

swarmCallback.on_train_end()

saved_model_path = os.path.join(scratchDir, model_name, 'saved_model.pt')
os.makedirs(scratchDir, exist_ok=True)
os.makedirs(os.path.join(scratchDir, model_name), exist_ok=True)
torch.save(model, saved_model_path)
print('Saved the trained model!')

if name == 'main': main()`

h-ahmad commented 2 years ago

Below code is the custom loader class.

import torch from torch.utils.data import Dataset, DataLoader import pandas as pd import os from skimage import io, transform import numpy as np class Cifar10Loader(Dataset): def init(self, csv_path, dataset_path, transform=None): self.image_names = pd.read_csv(csv_path) self.data_path = dataset_path self.transform = transform def len(self): return len(self.image_names) def getitem(self, index): if torch.is_tensor(index): index = index.tolist() img_name = os.path.join(self.data_path, self.image_names.iloc[index, 0]) img = io.imread(img_name)
label = self.image_names.iloc[index, 1:] label = np.array([label]) label = label.astype('float') label = torch.from_numpy(label)
label = label.type(torch.LongTensor) if self.transform: img = self.transform(img) return img, label

IMOKURI commented 2 years ago

Are you sure this code works fine without swarm learning?

Are you running it with a batch size of 1? I don't think it would learn well with a batch size of 1.

batchSz = 1

h-ahmad commented 2 years ago

Yes. It is working on local machine without swarm learning. However, it gives error when batch size is not 1. Moreover, it is wroking with epoch as 2. The problem occurs when increased number of epochs. If the network is small, it goes up to almost 70 epochs. But for this network, I have written, it gives error while running epoch 3. I am surprised that it works for smaller epoch and may be for limited time.

IMOKURI commented 2 years ago

The following situations do not seem to indicate that local learning is working well, and I recommend that these issues be resolved before using swarm learning.

it gives error when batch size is not 1.
it is wroking with epoch as 2. The problem occurs when increased number of epochs.
for this network, I have written, it gives error while running epoch 3.

h-ahmad commented 2 years ago

The following situations do not seem to indicate that local learning is working well, and I recommend that these issues be resolved before using swarm learning.

it gives error when batch size is not 1.

it is wroking with epoch as 2. The problem occurs when increased number of epochs.

for this network, I have written, it gives error while running epoch 3.

Okay. Thanks for your time and effort. I will inform you after I rectify the issues you mentioned. Thanks.

h-ahmad commented 2 years ago

The following situations do not seem to indicate that local learning is working well, and I recommend that these issues be resolved before using swarm learning.

it gives error when batch size is not 1.

it is wroking with epoch as 2. The problem occurs when increased number of epochs.

for this network, I have written, it gives error while running epoch 3.

I have run the example process with default setting and mnist model (mnist_pyt) with mnist dataset. I have just changed 'MAX_EPOCH=2' to 'MAX_EPOCH=100' in the 'swci/taskdef/swarm_mnist_task.yaml' file. It has still given the error at epoch 95. The error is given below (as in the first screenshot): swarmlearning.client.swarm.SwarmError: Sync Swarm call to SL container failed - wanted: 22 bytes, got 0 bytes: pipe closed?

RadhakrishnaJ commented 2 years ago

@h-ahmad, could you please run same exercise by removing last two RESETS in SWCI. I mean SWCI 29 - RESET TASKRUNNER command, SWCI 33 - RESET CONTRACT commands.

h-ahmad commented 2 years ago

@h-ahmad, could you please run same exercise by removing last two RESETS in SWCI. I mean SWCI 29 - RESET TASKRUNNER command, SWCI 33 - RESET CONTRACT commands.

Thanks to both of you. You helped and resolved my issue. I am closing the issue.

HewlettPackard / swarm-learning

Sync Swarm call to SL container failed #107

Issue description

OS and ML Platform

Quick Checklist: Respond [Yes/No]

Additional notes

NOTE: Create an archive with supporting artifacts and attach to issue, whenever applicable.

output = output.argmax(1)

acc = Accuracy(num_classes=13, average='weighted').to(device)