NVIDIA / DIGITS

Deep Learning GPU Training System
https://developer.nvidia.com/digits
BSD 3-Clause "New" or "Revised" License
4.12k stars 1.38k forks source link

Digits with Caffe on CPU #1911

Open laszl0 opened 6 years ago

laszl0 commented 6 years ago

Hello,

Not sure where this information belongs, just wanted to share.

I wanted to use digits on my MacBook Pro 13 without GPU. (also have another system, digits+coreos on top of gtx1070)

Pulled the latest docker image 6.0 meanwhile searching around the net how to run digits with caffe on CPU.

The suggestions that i found in multiple places was to compile caffe on OSX, then use the CAFFE_ROOT environment variable to point it to my local build of caffe.

I did not want to install caffe on my MacBook, because i want to use docker for these kind of situations.

Looked at caffe cpu Dockerfile and compared it with digits 6.0 Dockerfile to see what is the difference in parameters for cmake.

Fired up VSC, created a Dockerfile then pasted the contents of digits 6.0 Dockerfile.

Changed the line 70: cmake -DCMAKE_INSTALL_PREFIX=/usr/local/caffe -DUSE_NCCL=ON -DUSE_CUDNN=ON -DCUDA_ARCH_NAME=Manual -DCUDA_ARCH_BIN="35 52 60 61" -DCUDA_ARCH_PTX="61" .. && \

to cmake -DCMAKE_INSTALL_PREFIX=/usr/local/caffe -DCPU_ONLY=1 .. && \

and run docker build.

While docker was building, prepared a small dataset for object detection. After docker build, run the image, configured the dataset and model parameters in digits, then started training. Caffe was running in CPU mode, training my model.

Question, can this information be included in the docs? For people who want to use digits with caffe on cpu for whatever reasons.

Or there are complications when doing so and i'm not aware because not knowing enough.

Thanks.

ontheway16 commented 6 years ago

@laszl0 Hello, I followed the above instructions and build the docker image. I am new to docker environment. Upon completion, tried the following commands to run Digits: docker run --runtime=nvidia --name digits -d -p 5000:5000 nvidia/digits It worked. Then, tried the following to get see formerly-trained models within this Digits;

sudo nvidia-docker rm digits
sudo nvidia-docker run --runtime=nvidia --name digits -d -p 5000:5000 -p 6006:6006 -v /media/user01/bd00e3c9-7644-472e-b940-d88866d46538/home/user01/digits/digits/jobs:/jobs nvidia/digits:6.0

It also succeeded, old jobs were available in Digits web interface. Tried detectnet object detection inference while "nvidia-smi -l 1" running, unfortunately all process was still taking place in GPU. And, digits main page was saying 1/1 GPU available. Here is the output to $ sudo nvidia-docker images command:

REPOSITORY          TAG                              IMAGE ID            CREATED             SIZE
<none>              <none>                           a2e8b6178bad        2 hours ago         2.37GB
<none>              <none>                           a891b2b8bd61        3 hours ago         2.99GB
<none>              <none>                           1eab0a961686        3 hours ago         1.38GB
nvidia/cuda         latest                           9337ecb4311e        12 days ago         2.24GB
nvidia/cuda         8.0-cudnn5-devel-ubuntu14.04     9ef12db61cfd        12 days ago         1.89GB
nvidia/cuda         8.0-cudnn5-runtime-ubuntu14.04   b3c5f37b54b1        12 days ago         983MB
ubuntu              14.04                            8cef1fa16c77        2 weeks ago         223MB
hello-world         latest                           e38bc07ac18e        4 weeks ago         1.85kB
nvidia/digits       6.0                              fb4bfabb5acd        7 weeks ago         2.8GB
nvidia/digits       latest                           fb4bfabb5acd        7 weeks ago         2.8GB

First three repositories with no name was produced after (your) docker build command, as far as I know. I am not sure what they are. I am using ubuntu 16.04, by the way.

All I need is, CPU-only inference to be able to make use of larger memory, for detectnet. Can you or anyone interested tell me what I am doing wrong?

laszl0 commented 6 years ago

Hi @ontheway16 , upon reading what you did, i would like to suggest that, the first part of command sudo nvidia-docker run --runtime=nvidia .... ( this command is for running docker containers with Nvidia GPU support ) should be just sudo docker run ..... because you want to run the image on CPU.

If that does not work, what i wrote in the post, the docker image is available here on docker hub, if someone wants to use the image without doing the modifications that i have explained. The command that i have used: sudo docker run --name digits -d -p 5000:5000 -p 6006:6006 -v /Users/laszl0/Documents/test-digitsv6/jobs:/jobs laszl0/nvidia-digits-caffe-cpu (And yes in the Info tab, when running digits, it will still write 'Caffe flavor: Nvidia' , but that should not be a problem) This solution worked for me.

ontheway16 commented 6 years ago

@laszl0 Thank you very much for the reply. Before start trying, I just want to verify my idea on using CPU-inference due to larger memory. My idea was to train the model with digits-GPU (which I did already), then apply inference to larger images (normally not fit into GPU memory) using digits in CPU mode. Is it applicable or am I on the wrong route?

edit. Ok tested as docker.... and inference with same test image above, turn out with following error:

libdc1394 error: Failed to initialize libdc1394
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0516 12:46:46.416012    72 gpu_memory.hpp:27] Out of memory: failed to allocate 14008320 bytes on device 0
*** Check failure stack trace: ***

And GPU usage still rising to 3%s to 15% during Digits inference.

laszl0 commented 6 years ago

@ontheway16 you might be on the wrong route, because...

  1. Somebody would almost always use GPU for inference because of the speed improvement versus CPU inference, but let's say you want CPU anyway.
  2. The model that somebody trains, in this case you used DetectNet, if i'm correct, trained either on GPU or CPU, the architecture of network has an Input Layer and that has a size, usually WIDTHxHEIGHTxDEPTH(color channels), so when you are giving to this newly trained model an image to detect objects on it, that image has to match the size of Input Layer.

Lets say you followed this tutorial, when creating the dataset for training it says:

Resize images to make them compatible with the network architecture. For our example, we specify 1536 x 1024 pixels.

So all the images that you use for training are resized to "1536 x 1024 pixels". When doing prediction, be it either on CPU or GPU, your inputed image has to be resized to "1536 x 1024 pixels" otherwise the model that you have trained won't work correctly. There is even a restriction to the objects size that are annotated with the bounding boxes, in the case of tutorial:

the object size should be between 50Ɨ50 and 400Ɨ400 px in the input images.

Giving "larger images" to the model that you have trained to detect the objects, not really possible.

About the memory part that you are talking about, usually you use GPU for training because you train faster(DNN operations run faster on GPU because of recent(+-6years) advancements in both GPU architecture and DNN advancements), the greater memory size of GPU allows to create bigger models.

In your case, you have images, so the image is only a couple of megabytes (guess <100MB), because the size of Input Layer of the network limits this to be a certain size.

Hope my answer helps!

edit: I see your edit, in the above comment...

  1. How big is in MB the test image?
  2. Your docker image, my guess is built in a way to use GPU, even if you followed what i wrote. (to double check, compare your Dockerfile with mine )
ontheway16 commented 6 years ago

@laszl0 I appreciate for such a detailed answer. Despite years passed since the introduction of deep learning frameworks, object detection systems are only focusing to 'speed' which is something I dont need, at all.. My needs are centered around two things; Highest accuracy, and inference of small images (like 40-70 pixels WH) in very high resolution images (20-80 MegaPixels range, DSLR digital cameras are capable to produce this).

I was thinking of alternatives to 'how we can apply inference to very large images' for literally several months and the only solution possible come up as 'crop->infer->stitch', I guess you know what I mean. Unfortunately, I have failed to detect a single example for such a process, in object detection solutions, something hard to understand, since there are no cameras left around with 640x480 or 1280x800, etc. resolutions, inference parts of these systems should already be started to be designed to handle resolutions around 10MP (thinking all mobile phones)... I have pulled your docker image, and succesfully started the digits in CPU mode.

I want to share results of my experiments; I choose two models trained last year, 1st one was trained with 1248x384x3 images, with stride 8; Inference results; 1248x384: success. 2560x456: success. 2560x912: success. Then I switched to second model which was trained with 5120x896x3 images, stride 16, results; 2560x456: success. 2560x912: success. 5120x912: success. 5120x1824: success. 5120x2048: success. 5120x2296: success. 5120x3000: success. 5120x3200: success.

With the last one, I hit to my swapspace limit (poorly defined as 15GB, about 13.7GB used, along with 15.4GB free RAM. All filled in the last example)

Detection accuracy was as good as regular sized inferences, but do not know what happens If I further extend the image length, just ordered a couple of 16GB sticks to see this ;)

I am not sure how this happens since its opposite to your descriptions above, but I feel the "Do not resize input image(s)" option at inference page of Digits is responsible for this. Or maybe I didnt hit to actual walls of network design, yet... I am really thankful for the cpu mode you provided, I hope one day someone starts considering on an allinone way for applying inference to todays image dimensions.

laszl0 commented 6 years ago

@ontheway16 I might have misunderstood you...sorry...was to generic...

Thanks for sharing the idea that you are working on. :) What i know is, progress is made on this path, it's just that it's internal most of the time, so we have to wait until results are made public.

For your mobile example, i would say, it makes sense to not use the camera output of 10+MP,

  1. Sending that image to a server for inference, takes time and bandwidth costs.
  2. Storing it locally and then processing offline sometime in future, problematic, not user friendly, (maybe for specific use case might work).
  3. Mobile resources(CPU and RAM) for inferencing locally are limited, instead it's better to resize the image to (640x480 or 1280x800) and then run inference on a model that's fits into the limits. So using the raw images of a mobile camera, use cases are very specific, my opinion.

So coming back, i understand that there are raw images (20-80MP) for some problems, but usually some of the questions before starting to solve the problem are:

About your experiment results, i'm very interested in the further progress (16GB sticks solution).

I am not sure how this happens since its opposite to your descriptions above, but I feel the "Do not resize input image(s)" option at inference page of Digits is responsible for this.

Indeed.

ontheway16 commented 6 years ago

@laszl0 Ok, first, as a learner, I managed to run correct dockerimage among existing ones in $docker images list, which is the one built with your instructions, but it still shows 1/1 GPU available on top of Digits webpage. I am afraid I could have built it using nvidia-docker build..... command, that might be the reason why it enables GPU. Also, maybe irrelevant but I am using ubuntu 16.04, the dockerbuild document have something like 14.04, at the very first line. Should it stay like that or need modification due to 16.04 op. system?

Mobile resources(CPU and RAM) for inferencing locally are limited, instead it's better to resize the image to (640x480 or 1280x800) and then run inference on a model that's fits into the limits. So using the raw images of a mobile camera, use cases are very specific, my opinion.

You are right for the current generation of handsets but I believe mobile solutions to make a good use of existing hires onboard cameras will start to appear in short time, this is related to existence of object recognition and detection apps of course. And yes, cloud computing is one possibility but there are places where realtime inference needed, and without 3g connection. The problem of necessity of large images is mostly related to detection(or whatever) of relatively small images within the whole single hires image captured.

About your experiment results, i'm very interested in the further progress (16GB sticks solution).

I already ordered a couple of sticks to add extra 32 to existing 16, but meanwhile, found a temporary extra 16gb stick to experiment with. So it become 32GB+15GBswap. The result is, its capable to process 10240x1824 images but for some reason, not more then 3000pixels in lenght, for 5120x3K files. at 5120x3200 files I tried yesterday, a lower portion of image was not detected at all, as I later realised. After checking the deploy.prototxt within the relevant job folder, found the "cluster" section at the end of file were set as param_str: "5120, 3000, 16,..................... For some reason, the 5120 number at above line is allowing 10240 width (or anything inbetween 5K-10K width), but not allowing anything longer then 3000 in length, if this is relevant at all. for 20MP image, all 32GB ram used and took about 5 minutes to complete, with my i7-5820K CPU @ 3.30GHz Ɨ 12, with only single thread used for inference (one cpu thread remained 100%, until completion). So it seems it will work out after adding enough ram and some modifications, Now the question is, are we making a good use of existing CPU, read something about MKL-DNN etc, but do not how the intel-caffe found in your docker image is utilizing latest CPU-based features.

5 Minutes inference time is a bit at edge, for a good pile of images going to be processed :) And I am really sad that I cant make use of seven 1080ti's lying over there, for inference. I already created a solution for Digits GPU-inference of cropped parts of image and restitch, but its too patchy, requires other software, etc, so messy, you may take a look;

https://alpslabel.wordpress.com/2017/04/09/alps-large-image-annotation-tools-liant-for-detectnet/

May be same methodology could be applied by utilizing tf.image.non_max_suppression for the restitch part, I dont know..

ontheway16 commented 6 years ago

what do you want to classify, detect, segment in these images? can you use today's solutions to solve your problem, or have to invent a new solution?

@laszl0 I forgot to answer these two. I need to detect and count small objects in very large images, and I believe todays sw/hw tech is more then enough for this, only thing is, as you pointed out above, not released to public I believe. The only similar implementation for processing large images in parts and stitch results is in https://github.com/thstkdgus35/EDSR-PyTorch The author is using a "chop" option to apply super resolution to very large images, parts are stitching back together for the final image, not the same thing with object detection of course, but may give some ideas.

laszl0 commented 6 years ago

@ontheway16 congrats on managing with docker šŸ‘ .... don't worry about digits still shows 1/1 GPU and about the docker image being based on ubuntu 14.04 and your OS is 16.04...no worry for now.

regarding the parameters for "param_str", did you adjusted also the rest of the values? sample:

param_str: "5120, 3200, 16, 0.06, 3, 0.02, 10"

Just asking because as explained the param_str are passing the values to opencv groupRectangles, which in turn needs proper values because as it's pointed out here.

I have seen implementation documented, where the network architecture is created in a way that it crops the big image then does inference and then puts it back together at the end in one forward pass. So there are solutions to this, just not public, and it was created with tensorflow if remember correctly.

Great job doing the Alp's tool and even the custom solution for large images šŸ‘ .

I hope, one day, the steps above can be done within DIGITS framework. Until that day, enjoy the toolset and good luck !

You can fork DIGITS and make the modifications in DIGITS yourself if you want it, just saying.

About using CPU features, you have to build the docker image yourself on your machine, just create a folder, drop there my Dockerfile and build the docker image, then it will take advantage of your cpu features. Definitely is a pity to have so many 1080ti's lying and not doing anything, so what do you think about the second approach they describe in this article? Because my reason to go into that direction is of your small objects and using the Ti's, if you agree...

ontheway16 commented 6 years ago

@laszl0

regarding the parameters for "param_str", did you adjusted also the rest of the values?

Hi, I have tested many different numbers and the level of success is changing depending on objects' pixel dimensions (I am taking images at fixed distance so my objects are almost always at same dimension, a perfect target for a detector :) ) See an example below;

param_str: "5120, 864, 16, 0.01, 1, 0.020, 10, 2"

I guess this one was for two classes.

I have seen implementation documented, where the network architecture is created in a way that it crops the big image then does inference and then puts it back together at the end in one forward pass. So there are solutions to this, just not public, and it was created with tensorflow if remember correctly.

I know such projects exist, like the one below, https://blogs.flytbase.com/arabian-oryx-detection-counting/

You can fork DIGITS and make the modifications in DIGITS yourself if you want it, just saying.

I wish I could but my knowledge, on modifying a network architecture is very limited, currently I just modify some parameters etc, therefore I have no high hopes for that route. I can plan the function very well, but implementation is weak here :) Data science || programming is not my background.

Great job doing the Alp's tool and even the custom solution for large images +1 .

Thank you, I am putting all the code there, in hope of helping out someone else, and according to downloads, I think it did..

Definitely is a pity to have so many 1080ti's lying and not doing anything, so what do you think about the second approach they describe in this article? Because my reason to go into that direction is of your small objects and using the Ti's, if you agree...

Yeah, I did ask this when I first discovered object detection thing, and at that time, Greg offered segmentation, but the problem is, I am not interested with locational/area information of objects detected, I only need numbers per object class, so, how many of them per image. Therefore segmentation is no or little help for me since some of the same-class objects are touching each other and segmentation will see them as one, while detectnet is (weakly) capable to seperate them, and results are countable. While 80Ti's are working very well for training purposes, blaming Nvidia since they are not coming with a solution to high resolution images, like unifying RAMs of connected GPUs, etc. Should I wait for 32GB Tesla V100 price drop, to make use of GPU inference? Nowadays I am thinking on an intermediate solution, cropping images and saving them to RAMdisk, then stitch together with some faster NMS, like the one below, https://www.pyimagesearch.com/2014/11/17/non-maximum-suppression-object-detection-python or this one; https://www.pyimagesearch.com/2015/02/16/faster-non-maximum-suppression-python/

But of course non of these replaces natural code. And physically saving each cropped part means a lot of PNG encoding, which is taking ton of time, killing the GPU advantage. I even offered some small payment in hire sites for a singleshot solution, but seems no one interested. Therefore, CPU inference currently is invaluable to me. Its too bad Nvidia integrated Tensorflow into Digits, but they did not provided an official example for object detection, I am stuck with detectnet (Inception v2 I guess?). I even not sure if Nvidia keeps developing Digits..

Regards,

Alper

ontheway16 commented 6 years ago

@laszl0 5120x6000, success.

3x16gb sticks plus 15gb swap, was barely enough for 40MP image. swap was @ 13.7GB so 4x16gb will make it I guess.

laszl0 commented 6 years ago

@ontheway16 Lot of info, thanks! And sorry was busy and still busy these days, but congrats with the 40MP image, keep it up, i want to see where all this goes :), I'l be back with a couple of answers in the following days!

ontheway16 commented 6 years ago

@laszl0 Now I am looking to find a way to modify two lines in two files, in the docker image.

https://github.com/NVIDIA/caffe/blob/v0.15.13/python/caffe/layers/detectnet/clustering.py#L8 https://github.com/NVIDIA/caffe/blob/v0.15.13/python/caffe/layers/detectnet/mean_ap.py#L5

I need to make both of them 2000, but could not find how to do it while things are in docker...?

laszl0 commented 6 years ago

@ontheway16 Hi, about the other topics, i have an idea for big images, in the next weeks i will develop it, once i finish my current work.

About your caffe problem, if you have not figured it out already, check out this file and the line #Edit your caffe files. There you can edit other caffe files. You can use that dockerfile, it's the same as the last one, except i added these lines.

All the best!

ontheway16 commented 6 years ago

@laszl0 Great to hear this. I have several ideas for it too,but again, implementation. If we communicate, I believe we can develop a good solution to it. Meanwhile;

$ sudo docker ps -a
CONTAINER ID        IMAGE                            COMMAND              CREATED             STATUS                  PORTS                                            NAMES
d75dde1216f9        laszl0/nvidia-digits-caffe-cpu   "python -m digits"   30 minutes ago      Up 30 minutes           0.0.0.0:6006->6006/tcp, 0.0.0.0:5005->5000/tcp   digits
ef364c773449        hello-world                      "/hello"             3 days ago          Exited (0) 3 days ago                                                    kind_darwin

sudo docker exec -it d75dde1216f9 sh

cd /usr/local/python/caffe/layers/detectnet

sed -i "/MAX_BOXES = 50/c\MAX_BOXES = 2000" clustering.py

sed -i "/MAX_BOXES = 50/c\MAX_BOXES = 2000" mean_ap.py

exit

solved the problem without building it again.

laszl0 commented 6 years ago

@ontheway16 Indeed, sudo -ing into the docker container is easier...my bad!

Sounds good, let's try to come up with something together!

ontheway16 commented 6 years ago

@laszl0 Hi, it appears above MAX_BOXES commands are not permanent, so its better to build a new one with your mods.

What else, I made some modifications in network and now training/inferencing with stride 8 instead of 16. The cost is a lot. Applying inference to a 40MP image is filling 48GB RAM plus 79GB swap space, at stride8, took 25mins :) . Therefore, I have to admit that its not practical to try to infer whole large image at once, even by CPU. This is taking us back to applying gpu based inference to cropped parts. I can imagine a process like below;

  1. crop a part thats capable to fill the mem. of GPU during inference (predefined crop size here), and store crop offsets.
  2. feed the cropped part to first available GPU for inference
  3. collect detected bounding box coordinates, apply crop offset and store in an array etc.
  4. goto 1 until large image finishes.
  5. apply non-maximal supression to b_boxes to get rid of duplicate boxes.
  6. save bboxes of detected object class to disk, in Kitti format.

I think, very roughly, it should be something like this..?

laszl0 commented 6 years ago

Oh... indeed that's a lot of ram and time šŸ‘ . I was looking into the scenario that you described, how far did you got with implementation. Could you drop me an email instead, thanks!