BinWang-shu / M2FCN

Other
5 stars 3 forks source link

Network initialization and testing errors #3

Open aditipanda opened 5 years ago

aditipanda commented 5 years ago

@BinWang-shu There are three prototxt files, train_val, train_val_R, train_val_R2. As per the paper, there are 3 stages in the network. So, you first train a single stage network, then use it to initialize the 2-stage network, and then so on and so forth.

Can we use the single stage network's output to initialize the 3-stage network, without training the 2-stage network? How many iterations of training is required for all three stages?

I already trained the network with train_val.prototxt for 30000 iterations. But one thing I noticed and felt odd was, train_val.prototxt and others (train_val_R, train_val_R2) have batch_size set to 1. Is that a mistake?

aditipanda commented 5 years ago

While trying to test the above mentioned single-stage network with deploy.prototxt (since it is the testing code for the single stage network), I got this error:

Traceback (most recent call last): File "result.py", line 70, in img = img[:,:,::-1] IndexError: too many indices for array

I have not come across this syntax (highlighted in bold) before. Could you kindly help?

BinWang-shu commented 5 years ago

You can use the single stage network's output to initialize the 3-stage network, however, I am not sure it can get the best result. We had 30000 iterations for all three stages but you need accord to the actual situation. We set batch_size to 1 to train, it is not aa mistake. You need ensure that the imgs are three channels in testing, the error in 'img = img[:,:,::-1]' may happen when the input img is Grayscale map.

aditipanda commented 5 years ago

Okay. Thank you, @BinWang-shu . So it would be better if I use the 30000th snapshot of the single stage network to initialize the 2-stage network and then use the 30000th snapshot of the 2-stage network to initialize the 3-stage network?

For testing, which data do you use? ISBI data set has three volumes: train-volume.tif, label-volume.tif, and test-volume.tif. All three give grayscale images when extracted. Am I supposed to convert the test images to 3D before testing?

Also, for training, do you use augmented data set? For best results, we'd need augmented data set, right?

BinWang-shu commented 5 years ago

You need convert the test images to three channels before testing, just like copying it three times. You also need augment dataset.

aditipanda commented 5 years ago

Okay. And this is the way I should train: use the 30000th snapshot of the single stage network to initialize the 2-stage network and then use the 30000th snapshot of the 2-stage network to initialize the 3-stage network ?

BinWang-shu commented 5 years ago

This is the way we used.

aditipanda commented 5 years ago

Okay. Thank you.

aditipanda commented 5 years ago

I am just curious. What was your training time, assuming you are using augmented data and training the single stage network?

In my system, when I was using only 30 images, the time taken was same for when I am now using 2783 (augmented) images. I think something is wrong.

I checked the execution output on the terminal, and it says "A total of 2783 images", which means it has read all images. Then why isn't the training time increasing?

I used the "watch -n 1 nvidia-smi" command to see how much GPU memory is being used by this program. It is using only 2254 MB, which is the same amount as that used when only 30 images were used for training. Is this because the train_batch_size is 1? But even then it would have to process 2783 batches.

BinWang-shu commented 5 years ago

Every iteration inputs one image so the training time is not related to number of training images. It is related to the size of the image, the number of iterations, the computing power.

aditipanda commented 5 years ago

So how to ascertain it is processing all the augmented images?

BinWang-shu commented 5 years ago

Ensuring the number of iteration bigger than number of images.

aditipanda commented 5 years ago

I made the test images three channel as per your suggestion and all those syntax errors are gone now, but when I am running the program result.py, I get the following error:

F1013 16:54:25.805397 20479 syncedmem.cpp:19] Check failed: error == cudaSuccess (29 vs. 0) driver shutting down ** Check failure stack trace: Aborted (core dumped)**

I read about this error and many git-hub threads had told that this is a memory issue, so I tried reducing the number of images in the test folder to 1. That didn't work.

I also reduced the size of individual images, but that didn't work either.

Did you face any such error during testing? I am using an NVIDIA GeForce GTX 980 Ti.

Total Memory: 6144 MB Total dedicated Memory: 6073 MB

BinWang-shu commented 5 years ago

This is unrelated to the number of testing images. Maybe you need to further reduce the size of the images.

aditipanda commented 5 years ago

What size image did you use? I tried with as low as 64-by-64, but it still won't work :(

Original images are 512-by-512. Reducing image size will affect performance due to scalability, don't you think?

What are the specifications of the GPU you used for testing this code?

Also, I read in a git thread comment (https://github.com/BVLC/caffe/issues/5416#issuecomment-287506638) that try running "make runtest $nproc$" command at the caffe root, and see if it works. Is this necessary to make sure that testing works?

Anyway, I tried doing that. In this code, the caffe root is the M2FCN-master directory, right? It failed with the following stack trace:

[----------] Global test environment tear-down [==========] 1583 tests from 234 test cases ran. (169699 ms total) [ PASSED ] 1575 tests. [ FAILED ] 8 tests, listed below: [ FAILED ] SigmoidCrossEntropyLossLayerTest/0.TestGradient, where TypeParam = caffe::CPUDevice [ FAILED ] SigmoidCrossEntropyLossLayerTest/0.TestSigmoidCrossEntropyLoss, where TypeParam = caffe::CPUDevice [ FAILED ] SigmoidCrossEntropyLossLayerTest/1.TestSigmoidCrossEntropyLoss, where TypeParam = caffe::CPUDevice [ FAILED ] SigmoidCrossEntropyLossLayerTest/1.TestGradient, where TypeParam = caffe::CPUDevice [ FAILED ] SigmoidCrossEntropyLossLayerTest/2.TestGradient, where TypeParam = caffe::GPUDevice [ FAILED ] SigmoidCrossEntropyLossLayerTest/2.TestSigmoidCrossEntropyLoss, where TypeParam = caffe::GPUDevice [ FAILED ] SigmoidCrossEntropyLossLayerTest/3.TestSigmoidCrossEntropyLoss, where TypeParam = caffe::GPUDevice [ FAILED ] SigmoidCrossEntropyLossLayerTest/3.TestGradient, where TypeParam = caffe::GPUDevice

8 FAILED TESTS YOU HAVE 2 DISABLED TESTS

E1014 16:37:23.711421 4028 common.cpp:104] Cannot create Cublas handle. Cublas won't be available. E1014 16:37:23.711520 4028 common.cpp:111] Cannot create Curand generator. Curand won't be available. F1014 16:37:23.711552 4028 syncedmem.cpp:19] Check failed: error == cudaSuccess (29 vs. 0) driver shutting down Check failure stack trace: @ 0x7f463a7bc5cd google::LogMessage::Fail() @ 0x7f463a7be433 google::LogMessage::SendToLog() @ 0x7f463a7bc15b google::LogMessage::Flush() @ 0x7f463a7bee1e google::LogMessageFatal::~LogMessageFatal() @ 0x7f463965ee40 caffe::SyncedMemory::~SyncedMemory() @ 0x465c82 boost::detail::sp_counted_impl_p<>::dispose() @ 0x4654ca boost::detail::sp_counted_base::release() @ 0x7f463892d36a __cxa_finalize @ 0x7f463950a3f3 (unknown) Makefile:514: recipe for target 'runtest' failed make: *** [runtest] Aborted (core dumped)

I have also installed caffe at the root of my Ubuntu installation. This command works just fine there. Do these things affect the testing code? I am new to Caffe, and am lost here. Please help.

BinWang-shu commented 5 years ago

I used 1080 Ti and I trained the model with 512x512. I am not too sure about the problem.

aditipanda commented 5 years ago

Okay.

aditipanda commented 5 years ago

@BinWang-shu

I could not solve the earlier issue but it was occuring at the end of the testing process, so I could ignore it.

I read your paper and found that you use the watershed algorithm for improving the results of your networks. Do you use this code: https://github.com/seung-lab/Watershed.jl ?

If yes, do you convert the png outputs to hdf5 format? The segmentation masks produced by the proposed network are enhanced, so they are given as inputs, right?