# of GPU issue - Githubissues

610265158 / face_landmark

A simple method for face alignment based on wingloss and mutitask learning :)

Apache License 2.0

251 stars 80 forks source link

# of GPU issue #5

Closed elPerro92 closed 4 years ago

elPerro92 commented 4 years ago

Hi, I'm training the images with this method, I have a PC with 2 GPUs (RTX2080) and on the train_config.py I have set the line:

config.TRAIN.num_gpu = 2

but whenever I start the training is only using the first GPU.

610265158 commented 4 years ago

Hi, I'm training the images with this method, I have a PC with 2 GPUs (RTX2080) and on the train_config.py I have set the line:

config.TRAIN.num_gpu = 2

but whenever I start the training is only using the first GPU.

you still need to set the devices as visible , by os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

elPerro92 commented 4 years ago

Thanks very much, now it's working, I'm new to TensorFlow ;). It is possible to resume the training from the last checkpoint?

610265158 commented 4 years ago

Thanks very much, now it's working, I'm new to TensorFlow ;). It is possible to resume the training from the last checkpoint?

You're welcome by setting config.MODEL.continue_train=True; config.MODEL.pretrained_model='the_pretrained.ckpt';

And i suggest you that do not to use the codes now, becasue tf2.0 is released. It is better to learn the new one, and it is more friendly. And i am working on it : )

elPerro92 commented 4 years ago

Thanks for the fast response, I'm using the 1.14-gpu version because i'm using a Nvidia Jetson for the landmark recognition and that is le lastest version for it. I will use the 2.0 when will be released the stable version stable for Jeston.

elPerro92 commented 4 years ago

Hi, I've trained with 2 GPUs but the time to do one epoch is the same as with one GPU (56 minutes). When I try to restore to the last checkpoint, it show on the terminal this error:

tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Tensor name "ShuffleNetV2/Stage2/unit_1/conv1x1_after/BatchNorm/beta" not found in checkpoint files model/epoch_46L2_1e-05.ckpt.index [[node save/RestoreV2 (defined at /home/USER/path/to/face_landmark-master/lib/core/base_trainer/net_work.py:72) ]]

how i can restore correctly the checkpoint?

610265158 commented 4 years ago

Hi, I've trained with 2 GPUs but the time to do one epoch is the same as with one GPU (56 minutes). When I try to restore to the last checkpoint, it show on the terminal this error:

tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Tensor name "ShuffleNetV2/Stage2/unit_1/conv1x1_after/BatchNorm/beta" not found in checkpoint files model/epoch_46L2_1e-05.ckpt.index [[node save/RestoreV2 (defined at /home/USER/path/to/face_landmark-master/lib/core/base_trainer/net_work.py:72) ]]

how i can restore correctly the checkpoint?

hi, it should be config.MODEL.pretrained_model= 'model/epoch_46L2_1e-05.ckpt' no .index