Out of Memory issue - Githubissues

The training section can be found below, it's roughly the same with original code but with accumulated loss. The batch_size means the target batch instead of the real batch size we used for data loader. Out of Memory of GPU occurs with it.

        for i, (shower_data,incident_energies) in enumerate(shower_loader_train,0):
            # Move model to device and set dtype as same as data (note torch.double works on both CPU and GPU)
            model.to(device, shower_data.dtype)
            model.train()
            shower_data = shower_data.to(device)
            incident_energies = incident_energies.to(device)

            if len(shower_data) < 1:
                print('Very few hits in shower: ', len(shower_data))
                continue
            # Zero any gradients from previous steps
            optimiser.zero_grad()
            # Loss average for each batch
            loss = score_model.loss_fn(model, shower_data, incident_energies, marginal_prob_std_fn, padding_value, device=device)
            # Accumulate batch loss per epoch
            cumulative_epoch_loss+=float(loss)

            print(len(shower_data))
            batch_loss += loss
            batch_accumulate += len(shower_data)
            print(i, batch_accumulate, torch.cuda.memory_allocated(device))
            if batch_accumulate >= batch_size:
            # collect dL/dx for any parameters (x) which have requires_grad = True via: x.grad += dL/dx
                batch_loss.backward()
                batch_loss = 0
                batch_accumulate = 0
            # Update value of x += -lr * x.grad
                optimiser.step()
                torch.cuda.empty_cache()
                torch.no_grad()

Wilsker / tdsm_encoder

Out of Memory issue #18