encounter a problem while train with real hdr data

PK15946 commented 6 years ago

Hi again @gabrieleilertsen I am working on a pytorch implementation of your hdrcnn. I follow your instructions in the paper, which is load a pretrained vgg_places365 model and then train the network with places365 dataset. The result is fine. As shown below

Input input

Output predict

However, When I load the pretrained model and train with real hdr data, unexpectedly, the result was degraded. Here is a 1-million-steps result. predict

Do you have such phenomenon when train with real hdr data?

gabrieleilertsen commented 6 years ago

Is the test image you show JPEG compressed? And do you include compression in the training data? The artifacts look similar to those encountered when running inference on a JPEG compressed image using a model that is trained without compressed images...

PK15946 commented 6 years ago

The -jpeg_quality for generating train and pretrain data is set to 100, and the test image is in .tif format, could this be a problem? But I test the same image using your model with your ORIGINAL parameters. Here is the result. It looks much better. 000007_out

gabrieleilertsen commented 6 years ago

Ok, then it seems to be some other problem. Have you tried to first run the tensorflow training script to see that you get the expected result? If so, there should be some difference in your pytorch implementation that causes the problem.

If you get the same problem with the tensorflow training, there is possibly some problem with the HDR training data. How many HDR images are you using?

PK15946 commented 6 years ago

The result above is trained with 688 hdr images( about 13000 after preprocessed). I suspect that it is too small, so I am collecting more data according to your list. But I can't find the HDR BOOK dataset, since it contains considerable amount images, could you please tell me where can I find those data?

gabrieleilertsen commented 6 years ago

Yes, sounds like the limited number of HDR images could be the problem.

The images from the 2005 HDR book are not available online. A great online source of HDR are the HDR videos by Fröhlich et al., which makes for many images if you choose e.g. every 10th frame or similar.

PK15946 commented 6 years ago

Hello, @gabrieleilertsen

I download the hdrcnn-master.zip and unzip it, then completely follow your scripts(except the compilation, which I use "gcc virtualcamera.cpp -o virtualcamera -lm -lstdc++ -lopencv_core -lopencv_imgproc -lopencv_imgcodecs -std=c++11"), but it seems that the net was not trained correctly, Train loss and valid loss were both nan.

So I check the training data, but it looks fine to me, as shown below.

im_000003_000001 im_000004_000001 im_000001_000001 im_000002_000001

gabrieleilertsen commented 6 years ago

Very difficult to say what the problem can be without more information. My initial thought would be that there are pixel values <= 0 in the training data, or pixels that are NaN. The data preparation should take care of negative values (see rows 304-305 in the code). However, maybe something happened after that point, so to be sure you could check for this in the training images.

PK15946 commented 6 years ago

Thanks for kindly helping me in this matter, @gabrieleilertsen I check my dataset, there is no pixel value <0 or = nan in those .bin files and .jpg files. But still, the train loss is nan. I mean no offense, can you try the code downloaded from github?

gabrieleilertsen commented 6 years ago

The training script is tested before I uploaded it. There must be some problem either with the training data, or due to different installation environment. Maybe you could start by testing on a very small number of HDR images...

PK15946 commented 6 years ago

I use 2 images, one for training, the other one for validating. But I still get nan loss. Can you upload 5 training pairs(.bin and .jpg) so I can determine the problem?

gabrieleilertsen commented 6 years ago

Sorry for the late reply. I hope you were able to find the problem.

I was going to look at this a few days ago, but I had updated my Cuda version so that the OpenCV Cuda support did not work. However, when recompiling OpenCV I run in to problems. Apparently, there is some problems in compiling OpenCV with Cuda 9.0 and I did not have time to struggle with this. Consequently, I cannot run the virtual camera application to generate training data at the moment. I can get back on this problem when I manage to find time, if there still is a problem for you to get the training running.

PK15946 commented 6 years ago

Thanks in advance! I am not familiar with tensorflow, but I learned to debug these days, the tfdbg told me the first place that inf value occur was in 'encoder/h5/conv_1/Conv2d:0, I am not sure whether it is the training data problem or not, because if it is, I guess the inf/nan value should occur in the first layer. I don't want this problem impede your current research, so I will keep trying, and your help is welcomed when you are free. selection_005 selection_006

gabrieleilertsen commented 6 years ago

Was able to test it now, and I cannot see any problems. I uploaded some images here, so you can test if you encounter the same problem with this data. There are 5 original exr images, which are processed to generate 120 binary and jpg training pairs.

You can try both using original images, and process them with the virtual camera application:

python3 hdrcnn_train.py --raw_dir=test_data/original --vgg_path=vgg16_places365_weights.npy --preprocess=1

Or you can run with the provided processed images:

python3 hdrcnn_train.py --data_dir=test_data/training_data --vgg_path=vgg16_places365_weights.npy --preprocess=0

You may have to reduce the batch size depending on available memory. The VGG initialization weights are available here.

I hope this can help you find your problem. If you can run the above trainings without problems, there is something wrong in your training data. If you cannot, the problem is due do differences in our training environments, which would be good to find out what it is.

PK15946 commented 6 years ago

Those data really help me a lot! The Nan-problem was located in the VGG-weight initialization file I have downloaded before. I never expect it was corrupted because when I load the that file, it looks so normal, without warning or error.

ghost commented 5 years ago

@PK15946 你好，可以把vgg16_places365_weights.npy文件发给我一份吗，作者给的链接下载不了。zhuixun10@foxmail.com，谢谢！

wbhu commented 4 years ago

Those data really help me a lot! The Nan-problem was located in the VGG-weight initialization file I have downloaded before. I never expect it was corrupted because when I load the that file, it looks so normal, without warning or error.

Hi @PK15946,

I also want to implement this work by PyTorch. Could you please share your code. Thanks very much.

weiwang90 commented 4 years ago

Those data really help me a lot! The Nan-problem was located in the VGG-weight initialization file I have downloaded before. I never expect it was corrupted because when I load the that file, it looks so normal, without warning or error.

Hi，Can you share the vgg16_places365_weights.npy to me? The link website is down. My Email is 1300027@wust.edu.cn

gabrieleilertsen / hdrcnn

encounter a problem while train with real hdr data #14