KatherLab / swarm-learning-hpe

Experimental repo for Odelia project based on HPE platform. This repo contains multiple models for histopathology and radiology training.
MIT License
12 stars 1 forks source link

Callback on batchend runtime error #15

Closed Ultimate-Storm closed 1 year ago

Ultimate-Storm commented 1 year ago

only_on_batchend.log

Problem encountered when calling swarmCallback.on_batch_end() in:

    def training_step(self, batch, batch_idx):
        x, y = batch['source'], batch['target']
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        swarmCallback.on_batch_end()
        return loss

Model is resnet based on pytorch_lightning, error encountered possibly when swarm tries to write back merged weights to tensors, while pytorch doesn't support that. Might be related to issue to #14

Ultimate-Storm commented 1 year ago

Question: When I am trying to adapt your code to swarm learning. I need to invoke a swarm call back on the each of the batch end. During this the weights trained on distributed systems will be averaged and shared between nodes. I guess in this step it will write back the tensor with inplace operation. I tried to add the callback in your training_step() of the model but got an error like this RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512, 2]], which is output 0 of AsStridedBackward0, is at version 23; expected version 22 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). Do you have any idea to let pytorch allow user to modify the variables?

Ultimate-Storm commented 1 year ago

Fixed with call backs elsewhere