CRBS / cdeep3m

Please go to https://github.com/CRBS/cdeep3m2 for most recent version
Other
58 stars 10 forks source link

problems executing the example scripts #46

Closed damsport11 closed 5 years ago

damsport11 commented 5 years ago

Hi, I installed the software on our ubuntu 16.04 workstation with CUDA 8.0 and try to run the example scripts, however with both receive errors... Any suggestion?? Thanks in advance! Kevin

k.knoops@nano:~$ runprediction.sh ~/sbem/mitochrondria/xy5.9nm40nmz/30000iterations_train_out ~/cdeep3m-1.4.0/mito_testsample/testset/ ~/predictout30k Starting Image Augmentation Check image size of: /home/local/UNIMAAS/k.knoops/cdeep3m-1.4.0/mito_testsample/testset/ Reading file: /home/local/UNIMAAS/k.knoops/cdeep3m-1.4.0/mito_testsample/testset/images.081.png z_blocks =

1 5

Start up worker to generate packages to process Start up worker to run prediction on packages Start up worker to run post processing on packages

To see progress run the following command in another window:

tail -f /home/local/UNIMAAS/k.knoops/predictout30k/logs/*.log error: 'fileformats' undefined near line 13 column 30 error: called from filter_files at line 13 column 23 /home/local/UNIMAAS/k.knoops/cdeep3m-1.4.0/EnsemblePredictions.m at line 35 column 12 error: evaluating argument list element number 1 error: called from filter_files at line 13 column 23 /home/local/UNIMAAS/k.knoops/cdeep3m-1.4.0/EnsemblePredictions.m at line 35 column 12 ERROR, a non-zero exit code (1) was received from: EnsemblePredictions.m /home/local/UNIMAAS/k.knoops/predictout30k/1fm /home/local/UNIMAAS/k.knoops/predictout30k/3fm /home/local/UNIMAAS/k.knoops/predictout30k/5fm /home/local/UNIMAAS/k.knoops/predictout30k/ensembled k.knoops@nano:~$

k.knoops@nano:~/cdeep3m-1.4.0$ ./runtraining.sh /home/local/UNIMAAS/k.knoops/mito_testaugtrain ~/output Verifying input training data is valid ... success Copying over model files and creating run scripts ... success

A new directory has been created: /home/local/UNIMAAS/k.knoops/output In this directory are 3 directories 1fm,3fm,5fm which correspond to 3 caffe models that need to be trained

Detected 2 GPU(s). Will run in parallel. ERROR: caffe had a non zero exit code: 127 ERROR: caffe had a non zero exit code: 127 ERROR: caffe had a non zero exit code: 127 ERROR: caffe had a non zero exit code: 127 ERROR: caffe had a non zero exit code: 127 ERROR: caffe had a non zero exit code: 127 ERROR: caffe had a non zero exit code: 127 ERROR: caffe had a non zero exit code: 127 ERROR: caffe had a non zero exit code: 127 ERROR: caffe had a non zero exit code: 127 ERROR: caffe had a non zero exit code: 127 ERROR: caffe had a non zero exit code: 127 ERROR: caffe had a non zero exit code: 127 ERROR: caffe had a non zero exit code: 127 ERROR: caffe had a non zero exit code: 127 Non zero exit code from caffe for train of model. Exiting. ERROR, a non-zero exit code (1) was received from: trainworker.sh --numiterations 30000 k.knoops@nano:~/cdeep3m-1.4.0$

haberlmatt commented 5 years ago

Which GPUs are you using, the script detected you have 2 GPUs in your system?

For the prediction, the error I can see is in the ensemble, that there are no predictions from the individual models. Did you get any prediction for the individual models?

Maybe you can try running a prediction for e.g. just 1fm first and see if that works

haberlmatt commented 5 years ago

Also can you check and send if there are more specific error messages in the other log files, it might indicate if you are running out of GPU memory

damsport11 commented 5 years ago

Thanks for the help. Indeed it is missing the sbem-data, which is not in the github depository. Is there any place where I can retrieve it? For the second problem, I was yesterday running it on a workstation with a Quadro K2000 (GPU0; 2Gb) and Tesla K20c (GPU1; 5Gb). Now, I only would like to use the K20c for my runs and adjusted the "trainworker.sh" from gpu="all" to gpu="1". Nonetheless, I still see the Quadro running during my trail runs.. I am now shifting to another workstation with a Quadro K2000 (GPU0; 2Gb) and GeForce GTX Titan (12Gb) to see whether the GPU memory is resolved. I'll update when I have shifted. Thanks again for the help. Cheers, Kèvin

coleslaw481 commented 5 years ago

Hi, We are working to get the trained models on the Cell Image Library (http://www.cellimagelibrary.org/cdeep3m), but in the meantime you can get the trained model from a link on S3:

https://s3-us-west-2.amazonaws.com/cdeep3m-trainedmodels/sbem/mitochrondria/xy5.9nm40nmz/sbem_mitochrondria_xy5.9nm40nmz_30000iter_trainedmodel.tar.gz

A larger training dataset can be downloaded here: https://s3-us-west-2.amazonaws.com/cdeep3m-trainedmodels/sbem/mitochrondria/xy5.9nm40nmz/sbem_mitochrondria_xy5.9nm40nmz_30000iter_trainedmodel.tar.gz

With regards to running on a specific GPU, trainworker.sh has a --gpu flag that lets one specify the GPU to use as a number ie 0, 1 etc. I will make a ticket to add --gpu to runtraining.sh and runprediction.sh so its easier to deal with this, but in the meantime adding --gpu 0 to the argument list for trainworker.sh on line 152 of runtraining.sh should do the trick.

chris

damsport11 commented 5 years ago

Thanks Chris! That works beautiful and now without errors ;-) one last question before the segmentation can start: when i add the --1fmonly option, it complaints. Is that a bug or should I declare differently?

k.knoops@atto:~$ runtraining.sh --1fmonly --numiterations 100 ~/mito_testaugtrain ~/train_out octave: unable to open X11 DISPLAY octave: disabling GUI features Verifying input training data is valid ... success Copying over model files and creating run scripts ... success

A new directory has been created: /home/local/UNIMAAS/k.knoops/train_out In this directory are 3 directories 1fm,3fm,5fm which correspond to 3 caffe models that need to be trained $0: unrecognized option '--1fmonly'

Detected 2 GPU(s). Using only GPU 0

Thanks again for the help, I really appreciate it! Kevin

coleslaw481 commented 5 years ago

Hi, Great to hear. Oops there is a bug in runtraining.sh I'll make a ticket to fix this. To fix your code change line 94 of runtraining.sh to this:

--1fmonly ) one_fmonly="--models 1fm " ; shift ;;

The change is replacing "--1fmonly " with "--models 1fm ". Be sure to keep the space after 1fm and before the double quotes above.

coleslaw481 commented 5 years ago

Hi,

I just released version 1.5.0 fixing issues you ran into. I'm going to close this ticket, but feel free to re-open or create a new ticket if you run into any issues.

chris

damsport11 commented 5 years ago

Hi Chris, the training is working fine now, however, the prediction is not. For both tutorials I get the same error:

k.knoops@peta:~$ runprediction.sh ~/Scratch/sbem/mitochrondria/xy5.9nm40nmz/30000iterations_train_out /opt/scisoft/cdeep3m-1.5.0/mito_testsample/testset/ ~/predictout30k Starting Image Augmentation Check image size of: /opt/scisoft/cdeep3m-1.5.0/mito_testsample/testset/ Reading file: /opt/scisoft/cdeep3m-1.5.0/mito_testsample/testset/images.081.png z_blocks =

1 5

Start up worker to generate packages to process Start up worker to run prediction on packages Start up worker to run post processing on packages

To see progress run the following command in another window:

tail -f /home/local/UNIMAAS/k.knoops/predictout30k/logs/*.log /home/local/UNIMAAS/k.knoops/predictout30k/1fm not a directory Please use: EnsemblePredictions ./inputdir1 ./inputdir2 ./inputdir3 ./outputdir ERROR file found. Something went wrong ERROR, a non-zero exit code (1) received from PreprocessPackage.m 001 01 1fm 1 8

transcript of the log-file:

k.knoops@peta:~$ tail -f /home/local/UNIMAAS/k.knoops/predictout/logs/*.log ==> /home/local/UNIMAAS/k.knoops/predictout/logs/postprocess.log <== Running Postprocess

Trained Model Dir: /home/local/UNIMAAS/k.knoops/train_out_petaAll/ Image Dir: /opt/scisoft/cdeep3m/mito_testsample/testset/ Models: 1fm,3fm,5fm Speed: 1

For model 1fm postprocessing Pkg001_Z01 1 of 1 Waiting for /home/local/UNIMAAS/k.knoops/predictout/1fm/Pkg001_Z01 to finish processing KILL.REQUEST file found. Exiting

==> /home/local/UNIMAAS/k.knoops/predictout/logs/prediction.log <==

Running Prediction

Trained Model Dir: /home/local/UNIMAAS/k.knoops/train_out_petaAll/ Image Dir: /opt/scisoft/cdeep3m/mito_testsample/testset/ Models: 1fm,3fm,5fm Speed: 1

For model 1fm preprocessing Pkg001_Z01 1 of 1 KILL.REQUEST file found. Exiting

==> /home/local/UNIMAAS/k.knoops/predictout/logs/preprocess.log <==

Running PreprocessPackage

Trained Model Dir: /home/local/UNIMAAS/k.knoops/train_out_petaAll/ Image Dir: /opt/scisoft/cdeep3m/mito_testsample/testset/ Models: 1fm,3fm,5fm Speed: 1

Preprocessing Pkg001_Z01 in model 1fm ERROR, a non-zero exit code (1) received from PreprocessPackage.m 001 01 1fm 1

Thanks again ;-)

coleslaw481 commented 5 years ago

Hi, In the current implementation PreprocessPackage.m writes files to the input images directory. In your case this is /opt/scisoft/cdeep3m-1.5.0/mito_testsample/testset/ The error looks like the input images directory is not write able. To fix simply allow write access to this directory or copy the images to a directory with write access and invoke runprediction.sh against the copied directory. I'm going to make a ticket so these files get written to the output directory.

chris

damsport11 commented 5 years ago

Good point, however.... seems not to be causing the error I receive. I have placed everything in my homefolder and gave it full rights (777)... output is below:

k.knoops@peta:~$ runprediction.sh ~/train_out_peta/ ~/mito_testsample/testset/ ~/predictout Starting Image Augmentation Check image size of: /home/local/UNIMAAS/k.knoops/mito_testsample/testset/ Reading file: /home/local/UNIMAAS/k.knoops/mito_testsample/testset/images.081.png z_blocks =

1 5

Start up worker to generate packages to process Start up worker to run prediction on packages Start up worker to run post processing on packages

To see progress run the following command in another window:

tail -f /home/local/UNIMAAS/k.knoops/predictout/logs/.log /home/local/UNIMAAS/k.knoops/predictout/1fm not a directory Please use: EnsemblePredictions ./inputdir1 ./inputdir2 ./inputdir3 ./outputdir ERROR file found. Something went wrong ERROR, a non-zero exit code (1) received from PreprocessPackage.m 001 01 1fm 1 8 k.knoops@peta:~$ cd predictout/ k.knoops@peta:~/predictout$ ls augimages de_augmentation_info.mat ERROR KILL.REQUEST logs package_processing_info.txt predict.config readme.txt k.knoops@peta:~/predictout$ ll total 40 drwxr-xr-x 4 k.knoops domain users 4096 Sep 18 10:06 ./ drwxr-xr-x 60 k.knoops domain users 4096 Sep 18 10:06 ../ drwxr-xr-x 3 k.knoops domain users 4096 Sep 18 10:06 augimages/ -rw-r--r-- 1 k.knoops domain users 378 Sep 18 10:06 de_augmentation_info.mat -rw-r--r-- 1 k.knoops domain users 81 Sep 18 10:06 ERROR -rw-r--r-- 1 k.knoops domain users 79 Sep 18 10:06 KILL.REQUEST drwxr-xr-x 2 k.knoops domain users 4096 Sep 18 10:06 logs/ -rw-r--r-- 1 k.knoops domain users 45 Sep 18 10:06 package_processing_info.txt -rw-r--r-- 1 k.knoops domain users 164 Sep 18 10:06 predict.config -rw-r--r-- 1 k.knoops domain users 852 Sep 18 10:06 readme.txt k.knoops@peta:~/predictout$ ll ~/mito_testsample/testset/ total 4724 drwxrwxrwx 3 k.knoops domain users 4096 Sep 18 09:44 ./ drwxrwxrwx 5 k.knoops domain users 4096 Sep 18 09:41 ../ -rwxrwxrwx 1 k.knoops domain users 965875 Sep 18 09:41 images.081.png -rwxrwxrwx 1 k.knoops domain users 964476 Sep 18 09:41 images.082.png -rwxrwxrwx 1 k.knoops domain users 962790 Sep 18 09:41 images.083.png -rwxrwxrwx 1 k.knoops domain users 958437 Sep 18 09:41 images.084.png -rwxrwxrwx 1 k.knoops domain users 963317 Sep 18 09:41 images.085.png drwxrwxrwx 2 k.knoops domain users 4096 Sep 18 10:06 temp/ k.knoops@peta:~/predictout$ ll ~/train_out_peta/ total 40 drwxrwxrwx 5 k.knoops domain users 4096 Sep 17 12:46 ./ drwxr-xr-x 60 k.knoops domain users 4096 Sep 18 10:06 ../ drwxrwxrwx 4 k.knoops domain users 4096 Sep 17 12:46 1fm/ drwxrwxrwx 4 k.knoops domain users 4096 Sep 17 13:14 3fm/ drwxrwxrwx 4 k.knoops domain users 4096 Sep 17 13:44 5fm/ -rwxrwxrwx 1 k.knoops domain users 270 Sep 17 12:46 parallel.jobs -rwxrwxrwx 1 k.knoops domain users 573 Sep 17 12:46 readme.txt -rwxrwxrwx 1 k.knoops domain users 1191 Sep 17 12:46 train_file.txt -rwxrwxrwx 1 k.knoops domain users 1191 Sep 17 12:46 valid_file.txt -rwxrwxrwx 1 k.knoops domain users 6 Sep 17 12:46 VERSION* k.knoops@peta:~/predictout$

damsport11 commented 5 years ago

sorry for the stripe-through... please ignore ;-)

haberlmatt commented 5 years ago

Hi, can you post the content of ~/predictout/logs/preprocess.log

damsport11 commented 5 years ago

k.knoops@atto:~$ cat ~/predictout/logs/preprocess.log

Running PreprocessPackage

Trained Model Dir: /home/local/UNIMAAS/k.knoops/train_out_peta/ Image Dir: /home/local/UNIMAAS/k.knoops/mito_testsample/testset/ Models: 1fm,3fm,5fm Speed: 1

Preprocessing Pkg001_Z01 in model 1fm ERROR, a non-zero exit code (1) received from PreprocessPackage.m 001 01 1fm 1

Running PreprocessPackage

Trained Model Dir: /home/local/UNIMAAS/k.knoops/train_out_peta/ Image Dir: /home/local/UNIMAAS/k.knoops/mito_testsample/testset/ Models: 1fm,3fm,5fm Speed: 1

Preprocessing Pkg001_Z01 in model 1fm ERROR, a non-zero exit code (1) received from PreprocessPackage.m 001 01 1fm 1

coleslaw481 commented 5 years ago

Hi, One more log file request, could you send us the preproc.1fm.*log file, it should be under the ~/predictout/augimages/ directory.

thanks,

chris

haberlmatt commented 5 years ago

My guess would be one of the python packages is missing on your system, we'll update the documentation regarding this (but we will know more once you send us the log file Chris mentioned) My guess would be one of those three is missing: cv2 (OpenCV), joblib or requests See packages used in: https://github.com/CRBS/cdeep3m/blob/master/scripts/functions/crop_png.py

damsport11 commented 5 years ago

Indeed the CV2 of python...

k.knoops@nano:~/predictout/augimages$ cat preproc.1fm.Pkg001_Z01.log Starting Image Augmentation z_stack =

1 5

Image importer loading ... /home/local/UNIMAAS/k.knoops/mito_testsample/testset/ fid = 10 fid = 10 Traceback (most recent call last): File "/opt/scisoft/cdeep3m/scripts/functions/crop_png.py", line 13, in import cv2 ImportError: No module named cv2 ans = 1 error: imread: unable to find file /home/local/UNIMAAS/k.knoops/mito_testsample/testset/temp/images.081.png error: called from imageIO at line 71 column 7 imread at line 106 column 30 imageimporter_large at line 107 column 22 /opt/scisoft/cdeep3m/PreprocessPackage.m at line 62 column 8 Command exited with non-zero status 1 real 3.54 user 0.29 sys 0.08

damsport11 commented 5 years ago

Hi Chris, python package was installed and cdeep3m now run beautiful on own data. Very curious at the results! Thanks again for the help and sharing the software ;-) Kèvin

damsport11 commented 5 years ago

So the prediction has been running overnight and the 1fm + 3fm run very smoothly and I can clearly see the improvement of the 3fm after the 1fm. The 5fm however always exits with "out of memory" (see below for full error). This workstation has 4 Geforce GTX 1080 with 11Gb each, so I guess there is another problem... I also encountered this problem with the training, however used the "--iter_size 6" option to prevent running out of memory. Is there also such option with prediction?

---------- segmenting for /mnt/ssd/k.knoops/cdeep/predict_mito1_small_aug1/augimages/5fm/Pkg001_Z01/image_stacks_v1.h5 ---------- F0925 10:14:24.810606 953 syncedmem.cpp:57] Check failed: error == cudaSuccess (2 vs. 0) out of memory Check failure stack trace: @ 0x7feda48d35cd google::LogMessage::Fail() @ 0x7feda48d5433 google::LogMessage::SendToLog() @ 0x7feda48d315b google::LogMessage::Flush() @ 0x7feda48d5e1e google::LogMessageFatal::~LogMessageFatal() @ 0x7feda4cbb6d0 caffe::SyncedMemory::to_gpu() @ 0x7feda4cba369 caffe::SyncedMemory::mutable_gpu_data() @ 0x7feda4e059c2 caffe::Blob<>::mutable_gpu_data() @ 0x7feda4d95028 caffe::BaseConvolutionLayer<>::forward_gpu_gemm() @ 0x7feda4ea7f1c caffe::ConvolutionLayer<>::Forward_gpu() @ 0x7feda4e4e2b2 caffe::Net<>::ForwardFromTo() @ 0x7feda4e4e406 caffe::Net<>::ForwardPrefilled() @ 0x406739 Segmentor::Segment() @ 0x404c03 main @ 0x7feda3299830 __libc_start_main @ 0x404de9 _start @ (nil) (unknown) Command terminated by signal 6 real 62.38 user 42.80 sys 13.64

haberlmatt commented 5 years ago

Hi, yes we typically use GPUs with at least 12gb of memory, with the cloud formation we use K80s or V100s. You could try lowering the number of images in a block size in /scripts/functions/break_large_img.m replace the number 100 (in line 9 and 10) with 50 or 10 instead and see if that gets you around the problem

damsport11 commented 5 years ago

Works beautiful ;-) thanks again

haberlmatt commented 5 years ago

Great! I'll close the ticket -M