Facing issue while starting the Training

kalai2033 commented 4 years ago

Hi, I wanted to use your model for my research purpose. I tried to train the model as per the readme.file. But the training does seem to start at all.

Namespace(L2=0.0, batchSize=25, beta1=0.5, cuda=True, dataset='cityscapes', disc_iter=2, epoch_iter=5000, epochs=40, eval=False, experiment_name='25_samples_factorGAN', factorGAN=1, generator_channels=32, lipschitz_p=1, lipschitz_q=1, loadSize=128, lr=0.0001, num_joint_samples=25, nz=50, objective='JSD', out_path='out', seed=1337, use_real_dep_disc=1, workers=1) Random Seed: 1337 dataset [AlignedDataset] was created START TRAINING! Writing logs to out/Image2Image_cityscapes/25_samples_factorGAN/logs 2020-04-09 12:26:30.619713: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 ................................

There is nothing more appears in the console. Please help me out.

f90 commented 4 years ago

Hey! I suspect this is a problem with your CUDA installation, not something specific to my code. Are you able to run other Pytorch code successfully in your environment?

kalai2033 commented 4 years ago

Yes i have been able run other codes. I got this error. When i ran this in my local system, I got the below error

2020-01-25 11:20:08.541504: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory 2020-01-25 11:20:08.541639: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory 2020-01-25 11:20:08.541689: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly

So I ran this code in google colab environment. I am sharing my colab notebook link.

https://colab.research.google.com/drive/1GdEO9zgHLPnXTZ6a00lXyj0B2pHfyOVa

Kindly please check and let me know

f90 commented 4 years ago

It's a bit confusing there is an error occuring in tensorflow since this is a Pytorch project, the only tensorflow code might be used by tensorboard. Did you try running the code in a pip virtualenv where you install only packages listed in requirements.txt to make sure tensorflow does not interfere? Please pull the latest version of the code before you do that, I changed the requirements.txt to not include tensorflow anymore, and a few other things just now.

Also see these posts where very similar issues are reported, maybe this helps: https://github.com/tensorflow/tensorflow/issues/38100 https://github.com/tensorflow/tensor2tensor/issues/1643

f90 commented 4 years ago

Also it might be that the code is already running normally, it just doesn't say anything during training! Check the output logs via tensorboard! Also pull the latest version of my code, I put in a training progress bar so you should now see text output at each training step! And definitely use virtualenv to create a clean environment and then install the required packages listed in requirements.txt into it

kalai2033 commented 4 years ago

@f90 Thanks :)... It works now. I can see the epoch progress bar now. I have a few more doubts. 1) Can I use a rectangular image without resizing? My i/p image size is 600*400. 2) I don't see any checkpoints created. There have been 3 epochs completed so far. 3) How to test the final model... I'm using the following code to train

!python Image2Image.py --cuda --batchSize=10 --loadSize 256 --dataset "diff" --num_joint_samples 300 --factorGAN 1 --experiment_name "diff"

f90 commented 4 years ago

Glad that it works now! Since you are raising a bunch of new points now, I am going to create separate issues for those so we can handle these. Closing this issue now then, please post in the others from now.

f90 / FactorGAN

Facing issue while starting the Training #1