Open mertmerci opened 3 years ago
It looks like the forward calculation has not yet been performed, you can add logs here.
I tried to kill the process at first but I cannot kill it. I closed the terminal, add some log messages to the lines you mentioned and I rerun the code. Now it does not even output the' Starting Epoch...' line.
Do you have WeChat?
I use Telegram and Whatsapp but I can download it if you want right now.
Please run python cityscapes.py
(cityscapes.py) directly to verify whether the dataloader can work normally.
I run cityscapes.py and got the following error even I have the datasets downloaded.
./datasets/citys
is path to your cityscapes datasets._get_city_pairs
(_get_city_pairs) step by step.I think the folder structure (datasets) does not match the data reading code.
The path to the datasets is ./datasets/citys/leftImg8bit
and ./datasets/citys/gtFine
and they have their respective test, train and validation folders inside. How should them be arranged?
Should I move every image inside the city folders in ./datasets/citys/leftImg8bit/train
out?
It is not necessary, but you need to modify these codes to match your data folder.
The cityscapes.py
is working without any problems. I made a small mistake in my previous try, so here it is.
Please verify that dataloader is running normally.
for i, (images, targets) in enumerate(self.train_loader):
#cur_lr = self.lr_scheduler(cur_iters)
#for param_group in self.optimizer.param_groups:
#param_group['lr'] = cur_lr
images = images.to(self.args.device)
targets = targets.to(self.args.device)
print(images.shape)
"""
outputs = self.model(images)
loss = self.criterion(outputs, targets)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
cur_iters += 1
if cur_iters % 10 == 0:
print('Epoch: [%2d/%2d] Iter [%4d/%4d] || Time: %4.4f sec || lr: %.8f || Loss: %.4f' % (
epoch, args.epochs, i + 1, len(self.train_loader),
time.time() - start_time, cur_lr, loss.item()))
"""
When I make the mentioned changes, I still obtain the result I attached here. The thing is, I face these problems only when I try to run the train.py on GPU. On CPU, it runs normally.
I tried to kill the process at first but I cannot kill it. I closed the terminal, add some log messages to the lines you mentioned and I rerun the code. Now it does not even output the' Starting Epoch...' line.
What if you only use a single GPU?
The training started when I use only one GPU
However the training is really slow right now. I'm using GTX 1080 X, still the training takes around 4 days.
I'm trying to run the training on two GTX 1080Ti but it stuck on the first epoch. Is this due to some post-processing operations? How can I solve it?