Wilsker / tdsm_encoder

Transformer encoder time-dependent score model for generative modelling
0 stars 3 forks source link

Out of Memory issue #18

Open tihsu99 opened 11 months ago

tihsu99 commented 11 months ago

The training section can be found below, it's roughly the same with original code but with accumulated loss. The batch_size means the target batch instead of the real batch size we used for data loader. Out of Memory of GPU occurs with it.

        for i, (shower_data,incident_energies) in enumerate(shower_loader_train,0):
            # Move model to device and set dtype as same as data (note torch.double works on both CPU and GPU)
            model.to(device, shower_data.dtype)
            model.train()
            shower_data = shower_data.to(device)
            incident_energies = incident_energies.to(device)

            if len(shower_data) < 1:
                print('Very few hits in shower: ', len(shower_data))
                continue
            # Zero any gradients from previous steps
            optimiser.zero_grad()
            # Loss average for each batch
            loss = score_model.loss_fn(model, shower_data, incident_energies, marginal_prob_std_fn, padding_value, device=device)
            # Accumulate batch loss per epoch
            cumulative_epoch_loss+=float(loss)

            print(len(shower_data))
            batch_loss += loss
            batch_accumulate += len(shower_data)
            print(i, batch_accumulate, torch.cuda.memory_allocated(device))
            if batch_accumulate >= batch_size:
            # collect dL/dx for any parameters (x) which have requires_grad = True via: x.grad += dL/dx
                batch_loss.backward()
                batch_loss = 0
                batch_accumulate = 0
            # Update value of x += -lr * x.grad
                optimiser.step()
                torch.cuda.empty_cache()
                torch.no_grad()
tihsu99 commented 11 months ago

It is produced with notebook/5_training.ipynb