Training stuck on the first epoch

Tramac / Fast-SCNN-pytorch

A PyTorch Implementation of Fast-SCNN: Fast Semantic Segmentation Network

Apache License 2.0

380 stars 93 forks source link

Training stuck on the first epoch #44

Open mertmerci opened 3 years ago

mertmerci commented 3 years ago

I'm trying to run the training on two GTX 1080Ti but it stuck on the first epoch. Is this due to some post-processing operations? How can I solve it? Screenshot 2020-12-24 at 10 54 12

Tramac commented 3 years ago

It looks like the forward calculation has not yet been performed, you can add logs here.

mertmerci commented 3 years ago

I tried to kill the process at first but I cannot kill it. I closed the terminal, add some log messages to the lines you mentioned and I rerun the code. Now it does not even output the' Starting Epoch...' line. Screenshot 2020-12-24 at 11 31 44

Tramac commented 3 years ago

Do you have WeChat?

mertmerci commented 3 years ago

I use Telegram and Whatsapp but I can download it if you want right now.

Tramac commented 3 years ago

Please run python cityscapes.py(cityscapes.py) directly to verify whether the dataloader can work normally.

mertmerci commented 3 years ago

I run cityscapes.py and got the following error even I have the datasets downloaded. Screenshot 2020-12-24 at 13 47 29

Tramac commented 3 years ago

Make sure the ./datasets/citys is path to your cityscapes datasets.
Debug function _get_city_pairs(_get_city_pairs) step by step.

I think the folder structure (datasets) does not match the data reading code.

mertmerci commented 3 years ago

The path to the datasets is ./datasets/citys/leftImg8bit and ./datasets/citys/gtFine and they have their respective test, train and validation folders inside. How should them be arranged?

mertmerci commented 3 years ago

Should I move every image inside the city folders in ./datasets/citys/leftImg8bit/train out?

Tramac commented 3 years ago

It is not necessary, but you need to modify these codes to match your data folder.

mertmerci commented 3 years ago

The cityscapes.py is working without any problems. I made a small mistake in my previous try, so here it is. Screenshot 2020-12-24 at 15 39 13

Tramac commented 3 years ago

Please verify that dataloader is running normally.

for i, (images, targets) in enumerate(self.train_loader):
    #cur_lr = self.lr_scheduler(cur_iters)
    #for param_group in self.optimizer.param_groups:
    #param_group['lr'] = cur_lr

    images = images.to(self.args.device)
    targets = targets.to(self.args.device)
    print(images.shape)

    """
    outputs = self.model(images)
    loss = self.criterion(outputs, targets)

    self.optimizer.zero_grad()
    loss.backward()
    self.optimizer.step()

    cur_iters += 1
    if cur_iters % 10 == 0:
        print('Epoch: [%2d/%2d] Iter [%4d/%4d] || Time: %4.4f sec || lr: %.8f || Loss: %.4f' % (
                  epoch, args.epochs, i + 1, len(self.train_loader),
                  time.time() - start_time, cur_lr, loss.item()))
    """

mertmerci commented 3 years ago

When I make the mentioned changes, I still obtain the result I attached here. The thing is, I face these problems only when I try to run the train.py on GPU. On CPU, it runs normally.

I tried to kill the process at first but I cannot kill it. I closed the terminal, add some log messages to the lines you mentioned and I rerun the code. Now it does not even output the' Starting Epoch...' line.

Tramac commented 3 years ago

What if you only use a single GPU？

mertmerci commented 3 years ago

The training started when I use only one GPU

mertmerci commented 3 years ago

However the training is really slow right now. I'm using GTX 1080 X, still the training takes around 4 days.