ColeMurray / age-gender-estimation-tutorial

Tutorial for creating multi-task age and gender estimator in Tensorflow
40 stars 14 forks source link

Unable to find image gpu locally. #12

Open mohit-bansal opened 4 years ago

mohit-bansal commented 4 years ago

I have successfully built the images and done all the steps correctly. But when i run command: sudo docker run -v $PWD:/opt/app -e PYTHONPATH=$PYTHONPATH:/opt/app -it colemurray/age-gender-estimation-tutorial:gpu python3 /opt/app/bin/train.py --img-dir /opt/app/var/crop --train-csv /opt/app/var/train.csv --val-csv /opt/app/var/val.csv --model-dir /opt/app/var/cnn-model --img-size 224 --num-steps 200000

I get the following error: Unable to find image 'colemurray/age-gender-estimation-tutorial:gpu' locally docker: Error response from daemon: manifest for colemurray/age-gender-estimation-tutorial:gpu not found. See 'docker run --help'.

Please help @ColeMurray @aclex @Gius-8 . Thanks in advance.

Gius-8 commented 4 years ago

Hello, try to build and execute the CPU version.

mohit-bansal commented 4 years ago

Hello @Gius-8,

I build cpu version using: sudo docker build -t colemurray/age-gender-estimation-tutorial . which gave following result: Sending build context to Docker daemon 7.012GB Step 1/5 : FROM tensorflow/tensorflow:1.12.0-py3 ---> 39bcb324db83 Step 2/5 : RUN apt-get update && apt-get install -y libsm6 libxrender-dev libxext6 ---> Using cache ---> 3aac0d9f89d8 Step 3/5 : ADD $PWD/requirements.txt /requirements.txt ---> Using cache ---> 4cd1a3fab277 Step 4/5 : RUN pip3 install -r /requirements.txt ---> Using cache ---> f47ceb47b1ee Step 5/5 : CMD ["/bin/bash"] ---> Using cache ---> 9c5bb21d9c0f Successfully built 9c5bb21d9c0f Successfully tagged colemurray/age-gender-estimation-tutorial:latest

After this i used following command: sudo docker run -v $PWD:/opt/app -e PYTHONPATH=$PYTHONPATH:/opt/app -it colemurray/age-gender-estimation-tutorial:cpu python3 /opt/app/bin/train.py --img-dir /opt/app/var/crop --train-csv /opt/app/var/train.csv --val-csv /opt/app/var/val.csv --model-dir /opt/app/var/cnn-model --img-size 224 --num-steps 200000 which again gave the following error: Unable to find image 'colemurray/age-gender-estimation-tutorial:cpu' locally docker: Error response from daemon: manifest for colemurray/age-gender-estimation-tutorial:cpu not found. See 'docker run --help'.

I am totally new to this actually. Don't know what is wrong.

aclex commented 4 years ago

@mohit-bansal Try not to use :cpu postfix for the container name as you're probably building it from the plain Dockerfile, i.e. try as sudo docker run -v $PWD:/opt/app -e PYTHONPATH=$PYTHONPATH:/opt/app -it colemurray/age-gender-estimation-tutorial python3 /opt/app/bin/train.py …

mohit-bansal commented 4 years ago

@aclex Thanks. Using command without postfix works. It automatically picks whichever image cpu or gpu is available. Although running command with gpu image gives some errors regarding tensorflow but it seems working fine with cpu image.

Also, while training using cpu image, the terminal output stops after step 1:

sudo docker run -v $PWD:/opt/app -e PYTHONPATH=$PYTHONPATH:/opt/app -it colemurray/age-gender-estimation-tutorial python3 /opt/app/bin/train.py --img-dir /opt/app/var/crop --train-csv /opt/app/var/train.csv --val-csv /opt/app/var/val.csv --model-dir /opt/app/var/cnn-model --img-size 224 --num-steps 200000 INFO:tensorflow:Using config: {'_service': None, '_save_checkpoints_secs': None, '_tf_random_seed': None, '_task_type': 'worker', '_protocol': None, '_device_fn': None, '_save_checkpoints_steps': 1500, '_keep_checkpoint_every_n_hours': 10000, '_is_chief': True, '_save_summary_steps': 100, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_eval_distribute': None, '_train_distribute': None, '_global_id_in_cluster': 0, '_keep_checkpoint_max': 5, '_evaluation_master': '', '_model_dir': '/opt/app/var/cnn-model', '_master': '', '_task_id': 0, '_num_worker_replicas': 1, '_log_step_count_steps': 100, '_num_ps_replicas': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f7ae0be0400>, '_experimental_distribute': None} INFO:tensorflow:Not using Distribute Coordinator. INFO:tensorflow:Running training and evaluation locally (non-distributed). INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 1500 or save_checkpoints_secs None. INFO:tensorflow:Calling model_fn. INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. 2019-10-09 11:43:42.704587: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA INFO:tensorflow:Restoring parameters from /opt/app/var/cnn-model/model.ckpt-0 2019-10-09 11:43:42.835511: W tensorflow/core/framework/allocator.cc:122] Allocation of 411041792 exceeds 10% of system memory. 2019-10-09 11:43:42.835520: W tensorflow/core/framework/allocator.cc:122] Allocation of 411041792 exceeds 10% of system memory. 2019-10-09 11:43:42.839273: W tensorflow/core/framework/allocator.cc:122] Allocation of 411041792 exceeds 10% of system memory. 2019-10-09 11:44:06.101137: W tensorflow/core/framework/allocator.cc:122] Allocation of 411041792 exceeds 10% of system memory. 2019-10-09 11:44:06.101814: W tensorflow/core/framework/allocator.cc:122] Allocation of 411041792 exceeds 10% of system memory. INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Saving checkpoints for 0 into /opt/app/var/cnn-model/model.ckpt. INFO:tensorflow:loss = 5.639575, step = 1

Maybe its taking a lot of time for training. Will ask again if any problem occurs. Thanks for time being.

mohit-bansal commented 4 years ago

@aclex As i have mentioned in previous comment, the training has started but its taking a lot of time. For every 100 steps, its taking 8-10 minutes and we have mentioned 200000 no. of steps in the command. So, my question is: can you give me a least number of steps that I can mention in the command which will still train the model properly.

aclex commented 4 years ago

@mohit-bansal you're right, if I remember correctly, training takes quite long on the CPU. I don't know the exact number of epochs to have the model converged enough, I made the whole training myself, but what I can suggest for the CPU case is to try to run it outside the Docker, on the host system, either in virtual environment or (as I did myself) in system environment. Just repeat the commands in the Dockerfile. This is to slightly increase the speed, as though Docker virtualization is cheap, it still takes some resources, which might be useful for training in this case.

10shikha commented 3 years ago

@aclex Thanks. Using command without postfix works. It automatically picks whichever image cpu or gpu is available. Although running command with gpu image gives some errors regarding tensorflow but it seems working fine with cpu image.

Also, while training using cpu image, the terminal output stops after step 1:

sudo docker run -v $PWD:/opt/app -e PYTHONPATH=$PYTHONPATH:/opt/app -it colemurray/age-gender-estimation-tutorial python3 /opt/app/bin/train.py --img-dir /opt/app/var/crop --train-csv /opt/app/var/train.csv --val-csv /opt/app/var/val.csv --model-dir /opt/app/var/cnn-model --img-size 224 --num-steps 200000 INFO:tensorflow:Using config: {'_service': None, '_save_checkpoints_secs': None, '_tf_random_seed': None, '_task_type': 'worker', '_protocol': None, '_device_fn': None, '_save_checkpoints_steps': 1500, '_keep_checkpoint_every_n_hours': 10000, '_is_chief': True, '_save_summary_steps': 100, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_eval_distribute': None, '_train_distribute': None, '_global_id_in_cluster': 0, '_keep_checkpoint_max': 5, '_evaluation_master': '', '_model_dir': '/opt/app/var/cnn-model', '_master': '', '_task_id': 0, '_num_worker_replicas': 1, '_log_step_count_steps': 100, '_num_ps_replicas': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f7ae0be0400>, '_experimental_distribute': None} INFO:tensorflow:Not using Distribute Coordinator. INFO:tensorflow:Running training and evaluation locally (non-distributed). INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 1500 or save_checkpoints_secs None. INFO:tensorflow:Calling model_fn. INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. 2019-10-09 11:43:42.704587: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA INFO:tensorflow:Restoring parameters from /opt/app/var/cnn-model/model.ckpt-0 2019-10-09 11:43:42.835511: W tensorflow/core/framework/allocator.cc:122] Allocation of 411041792 exceeds 10% of system memory. 2019-10-09 11:43:42.835520: W tensorflow/core/framework/allocator.cc:122] Allocation of 411041792 exceeds 10% of system memory. 2019-10-09 11:43:42.839273: W tensorflow/core/framework/allocator.cc:122] Allocation of 411041792 exceeds 10% of system memory. 2019-10-09 11:44:06.101137: W tensorflow/core/framework/allocator.cc:122] Allocation of 411041792 exceeds 10% of system memory. 2019-10-09 11:44:06.101814: W tensorflow/core/framework/allocator.cc:122] Allocation of 411041792 exceeds 10% of system memory. INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Saving checkpoints for 0 into /opt/app/var/cnn-model/model.ckpt. INFO:tensorflow:loss = 5.639575, step = 1

Maybe its taking a lot of time for training. Will ask again if any problem occurs. Thanks for time being.

@mohit-bansal I am facing same issue of training stopping after one epoch. How did u fix it?