Using the pretrained model at /sbem/mitochrondria/xy5.9nm40nmz/30000iterations_train_out failed

cakuba commented 5 years ago

Hi, I got a problem here when trying to reproduce your results locally in a Linux docker. I followed all the steps as in the wiki just fine until this page: https://github.com/CRBS/cdeep3m/wiki/Tutorial-3-Run-CDeep3M.

I got stuck in step 5 with the command as

runtraining.sh --additerations 20 --retrain ~/sbem/mitochrondria/xy5.9nm40nmz/30000iterations_train_out ~/augtrain ~/model

The running of the above command failed as I realized that when using the pretrained model at sbem/mitochrondria/xy5.9nm40nmz/30000iterations_train_out/1fm/trainedmodel/1fm_classifer_iter_30000.solverstate, this solverstate file will call the model file "1fm_classifer_iter_30000.caffemodel" by default at the location /home/ubuntu/sbem/mitochrondria/xy5.9nm40nmz/30000iterations_train_out/1fm/trainedmodel/1fm_classifer_iter_30000.caffemodel, which was written into the solverstate binary file!

Since my installation of CDeep3M is not under the directory of /home/ubuntu, I'm wondering if it is possible to update this default path for caffemodel file in the solverstate file when retraining the model from 30000iterations_train_out. Furthermore, can somebody tell me that if I'm going to train my own model, the snapshot of the model output in solverstate file would include the absolute full path instead of relative path? It is important if someone else would like to use the pretrained model... otherwise, they have to reproduce exactly the same configuration environment as I have locally.

Thanks for your help.

coleslaw481 commented 5 years ago

Hi, Nice work in getting cdeep3m on docker.

We ran into that same issue a while back. To correct the problem we ended up forking and modifying the custom version of caffe (fork located here https://github.com/coleslaw481/caffe_nd_sense_segmentation) to look for the .caffemodel file in the same directory where the .solverstate file is located.

The AWS cloud formation template configuration aws/basic_cloudformation.json was updated to use the repo mentioned above, but we forgot to update the link in the README.md

I just made a pull request to get the link in the README.md updated.

Try compiling and installing this version of caffe:

https://github.com/coleslaw481/caffe_nd_sense_segmentation

If you look at the last commit on the above repo you can see what was changed.

Thanks for catching this and let us know if it works or if you run into any other issues.

cakuba commented 5 years ago

Hi, Chris,

Thanks for your super-fast response! Do appreciate it.

Yes, we do notice the difference of caffe version used in AWS template which provided a "user data" section to record all the configuration commands used in the cloud. And honestly, this "coleslaw481" version of the caffe confused us a lot since we absolutely have no idea why in AWS CDeep3M deployed a different version from the official Github... Anyway, good to know, and there were two more issues we have to mention when installing locally:

(1) lack of /usr/bin/time (2) have to update two scripts as script/functions/update__.m to change the default path of prototxt files

Also, for some reason, we have to update caffetrain.sh at line 166 to give the absolute path to --solver. Not sure why it happened. But to make the whole thing work is the top priority, and thus, we don't dig further for this...

Thanks again and really a nice model!

coleslaw481 commented 5 years ago

Hi, I made a ticket #56 to correct the first issue you mentioned.

I have a couple questions with regard to the second issue. Were you passing an absolute or relative path? Also were there any spaces or funny characters in the model path?

thanks,

chris

cakuba commented 5 years ago

Hi, Thanks for the update! Yes, we have to use the absolute path both in solver.prototxt and train_val.prototxt, and I don't notice any special characters in the path of our working directory. For instance, in solver.prototxt, the update is

net: "/usr/local/src/CDeep3M/cdeep3m-1.6.2/model_imod_train_out//1fm/train_val.prototxt" .. snapshot_prefix: "/usr/local/src/CDeep3M/cdeep3m-1.6.2/model_imod_train_out//1fm/trainedmodel/1fm_classifer"

if we also want to retrained the model from snapshot. In order to get this absolute path, those two Matlab scripts in script/functions/update__*.m are modified. And in train_val.prototax, we have to update the line containing "class_prob_mapping_file:" to include the absolute path as well. At this stage, I'm not sure if the usage of the absolute paths in these two files are "bugs" from Caffe or my ubuntu environment setup since in AWS, it seems that everything works fine.

Also, got two quick questions: (1) we are using two 1080Ti here, and it seems that the prediction of a single 1024x1024 png file with a 30000epoch-pretrained CDeep3M model costs ~3 mins, and 30 seconds for a single 512x512 png file. Is that normal? (2) would you please provide some estimate of memory usage for the training? we tried to train our own dataset of 200 images with the resolution of 1024x1024 each. It failed since the server is currently equipped with only 32GB memory, but the 1080Ti graphic card seems OK for us.

Thanks.

Brett

haberlmatt commented 5 years ago

Hi, regarding your questions:

2) I ran a quick test with a 210 image stack, and for the training i get a 45% memory usage of the process on an instance with 60GB (and a single K80). And on the K80 memory its using 9819MiB of its 11439MiB. So I agree with you that it's more likely that if it failed for you it's probably the memory on the machine rather then the 1080Ti. I'm not sure if you were using both 1080Ti's training two models in parallel, then this would most likely be too much for the 32Gb ram. For a single GPU it should be pretty close, so if you try with a stack of 150 images I'd expect you should already be able to run it even with the 32Gb ram.

1) We will do some tests on 1080s in the next weeks, but I'd expect they are slightly slower than the K80. Generally speaking the time each image takes to process depends on a couple factors, but if speed is a concern, consider that in the standard setting each image is processed 40 times (8 1fm + 16 3fm + 16*5fm) and we note that if you have a well trained model the augmentation can be reduced without losing too much in accuracy (see https://github.com/CRBS/cdeep3m/wiki/Speed-up) I'd recommend start setting --augspeed 10 you can combine it with using less models, eg. --models 1fm e.g. for the Demorun 1 I get the 5 images predicted in 9min with the K80 whereas i get the result in 19sec when adding the flags --augspeed 10 --models 3fm

Pre- and postprocessing also adds some time to the whole process since we needed a standard pipeline that will process large images as well

Hope this helps

Best, Matt

cakuba commented 5 years ago

Hi, Matt,

Thanks a lot for the detailed response! It does help us a lot. We will continue to use CDeep3M to our vessel and neural data and see how it works. By the way, do you guys ever consider to also fork a Tensorflow/Keras version of CDeep3M? In that way, I personally believe that it would be greatly benefited from the full power of Python libraries instead of octave/matlab style currently used.... Thanks again.

Brett

haberlmatt commented 5 years ago

Hi Brett, yes we are working on some enhancements in the background. Including a Python version. -Matt

CRBS / cdeep3m

Using the pretrained model at /sbem/mitochrondria/xy5.9nm40nmz/30000iterations_train_out failed #54