CRBS / cdeep3m

Please go to https://github.com/CRBS/cdeep3m2 for most recent version
Other
58 stars 10 forks source link

Error when running runprediction.sh on a local build. Any help is appreciated! #71

Closed camcondylis closed 5 years ago

camcondylis commented 5 years ago

runprediction.sh --augspeed 10 --gpu 0 /projectnb/jchenlab/CDeep3M_Segmentation/CRACK/Hackathon_models/647/model_cc030/model /projectnb/jchenlab/CDeep3M_Segmentation/CRACK/Hackathon_models/647/prediction_input/cc030/round1/ /projectnb/jchenlab/CDeep3M_Segmentation/CRACK/Hackathon_models/647/prediction_output/cc030/round1/ Starting Image Augmentation Check image size of: /projectnb/jchenlab/CDeep3M_Segmentation/CRACK/Hackathon_models/647/prediction_input/cc030/round1/ Reading file: /projectnb/jchenlab/CDeep3M_Segmentation/CRACK/Hackathon_models/647/prediction_input/cc030/round1/C3 - round10000.tif z_blocks =

1   52

Start up worker to generate packages to process Start up worker to run prediction on packages Start up worker to run post processing on packages

To see progress run the following command in another window:

tail -f /projectnb/jchenlab/CDeep3M_Segmentation/CRACK/Hackathon_models/647/prediction_output/cc030/round1//logs/*.log error: 'fileformats' undefined near line 13 column 30 error: called from filter_files at line 13 column 23 /home/cdeep3m/EnsemblePredictions.m at line 35 column 12 error: evaluating argument list element number 1 error: called from filter_files at line 13 column 23 /home/cdeep3m/EnsemblePredictions.m at line 35 column 12 ERROR, a non-zero exit code (1) was received from: EnsemblePredictions.m /projectnb/jchenlab/CDeep3M_Segmentation/CRACK/Hackathon_models/647/prediction_output/cc030/round1//1fm /projectnb/jchenlab/CDeep3M_Segmentation/CRACK/Hackathon_models/647/prediction_output/cc030/round1//3fm /projectnb/jchenlab/CDeep3M_Segmentation/CRACK/Hackathon_models/647/prediction_output/cc030/round1//5fm /projectnb/jchenlab/CDeep3M_Segmentation/CRACK/Hackathon_models/647/prediction_output/cc030/round1//ensembled cjcondy@scc-c11:~$

In prediction.log: Running Prediction

Trained Model Dir: /projectnb/jchenlab/CDeep3M_Segmentation/CRACK/Hackathon_models/647/model_cc030/model Image Dir: /projectnb/jchenlab/CDeep3M_Segmentation/CRACK/Hackathon_models/647/prediction_input/cc030/round1/ Models: 1fm,3fm,5fm Speed: 10 GPU: 0

For model 1fm preprocessing Pkg001_Z01 1 of 1 Running prediction on 1fm Pkg001_Z01 Detected 2 GPU(s). Using only GPU 0 ERROR non-zero exit code (1) from running predict_seg_new.bin Command exited with non-zero status 6 real 0.21 user 0.08 sys 0.12 ERROR, a non-zero exit code (6) was received from: caffepredict.sh

MatthewBM commented 5 years ago

Hi @camcondylis,

What are the two GPUs in that machine?

camcondylis commented 5 years ago

@MatthewBM They are V100 GPUs. I've also tried running without specifying a GPU (removing [--gpu 0]), but I get the same error.

MatthewBM commented 5 years ago

Inside of the output directory there should be a file 1fm/Pkg001_Z01/out.log. can you run the cat command on that file and paste the output here? Or open that file in notepad and send us the contents.

camcondylis commented 5 years ago

contents of 1fm/Pkg001_Z01/out.log:

Creating directory /projectnb/jchenlab/CDeep3M_Segmentation/CRACK/Hackathon_models/647/prediction_output/cc030/round1//1fm/Pkg001_Z01/v1 predict_seg_new.bin: error while loading shared libraries: libcudart.so.9.0: cannot open shared object file: No such file or directory Command exited with non-zero status 127 real 0.00 user 0.00 sys 0.00

MatthewBM commented 5 years ago

Did you install Cuda before trying this? That requirement is in the installation documents, we recommend Cuda 9 but versions above 7.5 are supported.

You can check Cuda version with this command: nvcc --version

if you do have it installed try this before running CDeep3m:

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Otherwise you'll need to install Cuda driver.

camcondylis commented 5 years ago

Starting Image Augmentation Check image size of: /projectnb/jchenlab/CDeep3M_Segmentation/CRACK/Hackathon_models/488/prediction_input/cc029/round5/ Reading file: /projectnb/jchenlab/CDeep3M_Segmentation/CRACK/Hackathon_models/488/prediction_input/cc029/round5/C2-round 5.0000.tif z_blocks =

1   85

Start up worker to generate packages to process Start up worker to run prediction on packages Start up worker to run post processing on packages

To see progress run the following command in another window:

tail -f /projectnb/jchenlab/CDeep3M_Segmentation/CRACK/Hackathon_models/488/prediction_output/test_jn039_model_cc029/round5test//logs/*.log /projectnb/jchenlab/CDeep3M_Segmentation/CRACK/Hackathon_models/488/prediction_output/test_jn039_model_cc029/round5test//1fm not a directory Please use: EnsemblePredictions ./inputdir1 ./inputdir2 ./inputdir3 ./outputdir ERROR file found. Something went wrong ERROR, a non-zero exit code (1) received from PreprocessPackage.m 001 01 1fm 10 8

Does this error also suggest that it may be a cuda issue? Cuda is installed, I have used cdeep3m before. I am running this on a shared compute cluster that just underwent an oeprating system upgrade, and since then I have had these errors occuring. To my knowledge cuda is still installed. I added those two lines you suggested without much luck

MatthewBM commented 5 years ago

the cuda/bin and cuda/local64 paths have to be added to the $PATH and $LD_LIBRARY_PATH variables respectively. The shared compute cluster's system admins should know where these paths are and should be able to fix this problem, since it seems the cluster update caused the issue.

Let us know.

camcondylis commented 5 years ago

Thanks for your help. I'll keep you updated.