artyom-beilis / pytorch_dlprim

DLPrimitives/OpenCL out of tree backend for pytorch
http://blog.dlprimitives.org/
MIT License
276 stars 17 forks source link

Occasional Memory Access Fault (GPU) #28

Closed arch-user-france1 closed 7 months ago

arch-user-france1 commented 1 year ago

Hello

I seem to have stumbled over a bug in the program - at first it all run fine, but while I was fiddling and with some parameters restarting the training multiple times the code suddenly crashed.

Here's what happened: image

INFO:root:Using the privateuseone:0 backend.
Accessing device #0:gfx1100 on AMD Accelerated Parallel Processing
INFO:root:TRAINING STARTING
Epoch 0
Epoch 17] - 99.9%  STEP: [HISTORY]
Memory access fault by GPU node-1 (Agent handle: 0x5d8d620) on address 0x7f810481d000. Reason: Page not present or supervisor privilege.
[1]    18956 IOT instruction (core dumped)  /home/france1/ZFS/AI/.conda/bin/python /home/france1/ZFS/AI/GAN/main.py

It crashed after the first epoch, probably in the middle or the start of the first one.

It has been training a DCGAN network - see the architecture:

    class Generator(nn.Module):
        def __init__(self, coding_sz) -> None:
            super().__init__()

            self.net = nn.Sequential(
                nn.ConvTranspose2d(coding_sz, 1024, 4, 1, 0),
                nn.BatchNorm2d(1024),
                nn.ReLU(),
                nn.ConvTranspose2d(1024, 512, 4, 2, 1),

                nn.BatchNorm2d(512),
                nn.ReLU(),
                nn.ConvTranspose2d(512, 256, 4, 2, 1),

                nn.BatchNorm2d(256),
                nn.ReLU(),
                nn.ConvTranspose2d(256, 128, 4, 2, 1),

                nn.BatchNorm2d(128),
                nn.ReLU(),
                nn.ConvTranspose2d(128, 1, 4, 2, 1),

                nn.Tanh()
            )

        def forward(self, input):
            return self.net(input)

    class Discriminator(nn.Module):
        def __init__(self, coding_sz) -> None:
            super().__init__()

            self.net = nn.Sequential(
                nn.Conv2d(1, 128, 4, 2, 1),
                nn.LeakyReLU(0.2),
                nn.Conv2d(128, 256, 4, 2, 1),
                nn.BatchNorm2d(256),
                nn.LeakyReLU(0.2),
                nn.Conv2d(256, 512, 4, 2, 1),
                nn.BatchNorm2d(512),
                nn.LeakyReLU(0.2),
                nn.Conv2d(512, 1024, 4, 2, 1),
                nn.BatchNorm2d(1024),
                nn.LeakyReLU(0.2),
                nn.Conv2d(1024, 1, 4, 1, 0),
                nn.Sigmoid()
            )

        def forward(self, input):
            return self.net(input)

    netG = Generator(CODING_SIZE).to(device)
    netD = Discriminator(CODING_SIZE).to(device)

    def weights_init(m):
        classname = m.__class__.__name__
        if classname.find('Conv') != -1:
            nn.init.normal_(m.weight, 0.0, 0.02)
        elif classname.find('BatchNorm') != -1:
            nn.init.normal_(m.weight, 1.0, 0.02)
            nn.init.constant_(m.bias.data, 0)

    netG.apply(weights_init)
    netD.apply(weights_init)

I guess it'll take you a hard time debugging, especially since I have not provided you any further information. If you would like me to try something to discover the bug in the code, I would be happy to help.

It could as well be a problem with the underlying driver, because I have seen abrupt crashes of the screen output and things messing up. I doubt the card arrived broken, because on Windows it's running okay.

Happy working and have a great time

artyom-beilis commented 1 year ago

Have you tried to reduce memory consumption? Reduce batch size.

IIRC these issues sometimes happen when no memory left and AMD driver generally allows over-committing memory such that sometimes you may think you use GPU memory but it is paging it to CPU.

I would start from that.

arch-user-france1 commented 1 year ago

Hmm, no, I am sure it was not memory. I've got 40GB and not more than 4GB were used. Also consider that it came up pretty random - and now I've been training with a large batch size of 128 and there was no such issue.

But while it was training it just suddenly got stuck - the python process would not finish (UGH NO GOD while writing on another device the display output got messed again - everything crashed) and even after killall python the GPU remained at 100%.


grafik

This is what just happened when I tried logging into the crashed system.

arch-user-france1 commented 7 months ago

AMD Driver issue