CRBS / cdeep3m

Please go to https://github.com/CRBS/cdeep3m2 for most recent version
Other
58 stars 10 forks source link

caffepredict doesn't parrellelize properly #66

Closed MatthewBM closed 5 years ago

MatthewBM commented 5 years ago

Hi @coleslaw481,

I'm hoping you could take a minute to look at this issue specifically, this line from caffepredict.sh:

https://github.com/CRBS/cdeep3m/blob/a80cc773de556652560f1e0c7d9584455634548a/caffepredict.sh#L148

The parallel_job file looks good, the augments are passed in order to each gpu sequentially:

[mmadany@comet-ln2 Pkg011_Z02]$ cat ../Pkg001_Z01/parallel.jobs /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/log /oasis/scratch/comet/mmadany/temp_project/Keun/msog_perf_sixset8kadd_trnet/3fm/trainedmodel/.. /oasis/scratch/comet/mmadany/temp_project/Keun/msog_perf_sixset8kadd_trnet/3fm/trainedmodel/3fm_classifer_iter_40000.caffemodel /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg001_Z01/image_stacks_v1.h5 /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/v1 0 /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/log /oasis/scratch/comet/mmadany/temp_project/Keun/msog_perf_sixset8kadd_trnet/3fm/trainedmodel/.. /oasis/scratch/comet/mmadany/temp_project/Keun/msog_perf_sixset8kadd_trnet/3fm/trainedmodel/3fm_classifer_iter_40000.caffemodel /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg001_Z01/image_stacks_v2.h5 /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/v2 1 /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/log /oasis/scratch/comet/mmadany/temp_project/Keun/msog_perf_sixset8kadd_trnet/3fm/trainedmodel/.. /oasis/scratch/comet/mmadany/temp_project/Keun/msog_perf_sixset8kadd_trnet/3fm/trainedmodel/3fm_classifer_iter_40000.caffemodel /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg001_Z01/image_stacks_v3.h5 /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/v3 2 /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/log /oasis/scratch/comet/mmadany/temp_project/Keun/msog_perf_sixset8kadd_trnet/3fm/trainedmodel/.. /oasis/scratch/comet/mmadany/temp_project/Keun/msog_perf_sixset8kadd_trnet/3fm/trainedmodel/3fm_classifer_iter_40000.caffemodel /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg001_Z01/image_stacks_v4.h5 /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/v4 3 /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/log /oasis/scratch/comet/mmadany/temp_project/Keun/msog_perf_sixset8kadd_trnet/3fm/trainedmodel/.. /oasis/scratch/comet/mmadany/temp_project/Keun/msog_perf_sixset8kadd_trnet/3fm/trainedmodel/3fm_classifer_iter_40000.caffemodel /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg001_Z01/image_stacks_v13.h5 /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/v13 0 /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/log /oasis/scratch/comet/mmadany/temp_project/Keun/msog_perf_sixset8kadd_trnet/3fm/trainedmodel/.. /oasis/scratch/comet/mmadany/temp_project/Keun/msog_perf_sixset8kadd_trnet/3fm/trainedmodel/3fm_classifer_iter_40000.caffemodel /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg001_Z01/image_stacks_v14.h5 /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/v14 1 /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/log /oasis/scratch/comet/mmadany/temp_project/Keun/msog_perf_sixset8kadd_trnet/3fm/trainedmodel/.. /oasis/scratch/comet/mmadany/temp_project/Keun/msog_perf_sixset8kadd_trnet/3fm/trainedmodel/3fm_classifer_iter_40000.caffemodel /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg001_Z01/image_stacks_v15.h5 /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/v15 2 /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/log /oasis/scratch/comet/mmadany/temp_project/Keun/msog_perf_sixset8kadd_trnet/3fm/trainedmodel/.. /oasis/scratch/comet/mmadany/temp_project/Keun/msog_perf_sixset8kadd_trnet/3fm/trainedmodel/3fm_classifer_iter_40000.caffemodel /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg001_Z01/image_stacks_v16.h5 /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/v16

However, here is what the out.log from the same package says:

[mmadany@comet-ln2 Pkg011_Z02]$ cat ../Pkg001_Z01/out.log Creating directory /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/v1 Creating directory /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/v2 Creating directory /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/v3 Creating directory /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/v4 Creating directory /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/v13 Creating directory /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/v14 Creating directory /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/v15 Creating directory /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg001_Z01/v16 ---------- segmenting for /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg001_Z01/image_stacks_v1.h5 ---------- real 65.17 user 38.56 sys 23.33 ---------- segmenting for /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg001_Z01/image_stacks_v2.h5 ---------- real 66.35 user 39.61 sys 23.48 ---------- segmenting for /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg001_Z01/image_stacks_v4.h5 ---------- real 62.71 user 37.27 sys 22.22 ---------- segmenting for /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg001_Z01/image_stacks_v3.h5 ---------- real 65.51 user 39.47 sys 22.30 ---------- segmenting for /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg001_Z01/image_stacks_v13.h5 ---------- real 64.48 user 38.71 sys 22.39 ---------- segmenting for /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg001_Z01/image_stacks_v14.h5 ---------- real 62.33 user 36.78 sys 22.10 ---------- segmenting for /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg001_Z01/image_stacks_v15.h5 ---------- real 65.75 user 40.46 sys 21.97 ---------- segmenting for /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg001_Z01/image_stacks_v16.h5 ---------- real 63.41 user 38.14 sys 22.15

caffepredict should have passed v1,v2,v3,v4,v13,v14,v15,v16 but instead it swapped v3 and v4, this random swapping occurs in all the packages I've seen. CDeep3m runs fine until the order happens to make it so two commands on the same gpu are sequential, instead of spaced out by $max_gpu_count, at which point two jobs would be assigned to the same gpu in 2 seconds, and it will run out of memory, such as with this package:

[mmadany@comet-ln2 3fm]$ cat Pkg011_Z02/out.log Creating directory /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg011_Z02/v1 Creating directory /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg011_Z02/v2 Creating directory /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg011_Z02/v3 Creating directory /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg011_Z02/v4 Creating directory /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg011_Z02/v13 Creating directory /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg011_Z02/v14 Creating directory /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg011_Z02/v15 Creating directory /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/3fm/Pkg011_Z02/v16 ---------- segmenting for /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg011_Z02/image_stacks_v1.h5 ---------- real 48.46 user 22.20 sys 16.01 ---------- segmenting for /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg011_Z02/image_stacks_v4.h5 ---------- real 46.08 user 22.42 sys 15.58 ---------- segmenting for /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg011_Z02/image_stacks_v2.h5 ---------- F0514 10:20:30.422610 127717 syncedmem.cpp:57] Check failed: error == cudaSuccess (2 vs. 0) out of memory Check failure stack trace: @ 0x2aad3d6d95cd google::LogMessage::Fail() @ 0x2aad3d6db433 google::LogMessage::SendToLog() @ 0x2aad3d6d915b google::LogMessage::Flush() @ 0x2aad3d6dbe1e google::LogMessageFatal::~LogMessageFatal() @ 0x2aad3ceb0170 caffe::SyncedMemory::to_gpu() @ 0x2aad3ceaee09 caffe::SyncedMemory::mutable_gpu_data() @ 0x2aad3cd10ef2 caffe::Blob<>::mutable_gpu_data() @ 0x2aad3cef0b1d caffe::ConvolutionLayer<>::Forward_gpu() @ 0x2aad3ce5c1d2 caffe::Net<>::ForwardFromTo() @ 0x2aad3ce5c326 caffe::Net<>::ForwardPrefilled() @ 0x406739 Segmentor::Segment() @ 0x404c03 main @ 0x2aad3eba4830 __libc_start_main @ 0x404de9 _start @ (nil) (unknown) Command terminated by signal 6 real 53.88 user 21.21 sys 15.12 ---------- segmenting for /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg011_Z02/image_stacks_v3.h5 ---------- real 51.72 user 21.77 sys 15.65 ---------- segmenting for /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg011_Z02/image_stacks_v14.h5 ---------- F0514 10:20:30.323266 127958 syncedmem.cpp:57] Check failed: error == cudaSuccess (2 vs. 0) out of memory Check failure stack trace: @ 0x2b86cbecb5cd google::LogMessage::Fail() @ 0x2b86cbecd433 google::LogMessage::SendToLog() @ 0x2b86cbecb15b google::LogMessage::Flush() @ 0x2b86cbecde1e google::LogMessageFatal::~LogMessageFatal() @ 0x2b86cb6a2170 caffe::SyncedMemory::to_gpu() @ 0x2b86cb6a0e09 caffe::SyncedMemory::mutable_gpu_data() @ 0x2b86cb502ef2 caffe::Blob<>::mutable_gpu_data() @ 0x2b86cb54dd68 caffe::BaseConvolutionLayer<>::forward_gpu_gemm() @ 0x2b86cb6e2b6c caffe::ConvolutionLayer<>::Forward_gpu() @ 0x2b86cb64e1d2 caffe::Net<>::ForwardFromTo() @ 0x2b86cb64e326 caffe::Net<>::ForwardPrefilled() @ 0x406739 Segmentor::Segment() @ 0x404c03 main @ 0x2b86cd396830 __libc_start_main @ 0x404de9 _start @ (nil) (unknown) Command terminated by signal 6 real 2.38 user 1.12 sys 1.12 ---------- segmenting for /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg011_Z02/image_stacks_v13.h5 ---------- real 51.64 user 21.07 sys 15.95 ---------- segmenting for /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg011_Z02/image_stacks_v16.h5 ---------- real 46.93 user 22.07 sys 15.54 ---------- segmenting for /oasis/scratch/comet/mmadany/temp_project/Keun/CDeep3M-BatchJob-7Hnfq9gUomgr7EIqcLL56ZUHtuyj7TXG/Partition-3-cdeep3m-output-monitor-rd2/augimages/3fm/Pkg011_Z02/image_stacks_v15.h5 ---------- real 50.40 user 21.54 sys 15.97

In this case, v2 and v14 are passed to predict_seg_new.bin on the same GPU in that 2 second interval. I was hoping there's just a single argument in parallel that needs changed or added.

redistributer commented 5 years ago

I have been working on this issue and I've initiated a pull request that should fix it.