Closed erikkockar closed 2 years ago
set --batchsize 16
set --batchsize 16
batchsize of 16
RuntimeError: CUDA out of memory. Tried to allocate 80.00 MiB (GPU 0; 5.78 GiB total capacity; 3.73 GiB already allocated; 19.06 MiB free; 3.77 GiB reserved in total by PyTorch)
batchsize of 8
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 5.78 GiB total capacity; 3.26 GiB already allocated; 31.38 MiB free; 3.32 GiB reserved in total by PyTorch)
But with newer version of torch I got extension of the error and it says
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Sorry for not solving your problem. I haven't saw this error before. I am using cuda11 pytorch1.7 .
Sorry for not solving your problem. I haven't saw this error before. I am using cuda11 pytorch1.7 .
Little update batchsize of four works for me. But still wondering why I cant do more since I believe I have computation power for that. Seems to me to be torch problem with wrong allocation.
python3 train.py --batchsize 4 --data ~/ndds/Dataset_Synthesizer/Source/NVCapturedData/TestCapturer/ --object neuron --outf neuron --gpuids 0
Another thing now I am wondering about that I am training my dataset on pretrained image net and loss is too small from the start:
start: 09:33:31.601213
load data
training data: 506 batches
load models
Training network pretrained on imagenet.
Train Epoch: 1 [0/2022 (0%)] Loss: 0.029323829337955
Train Epoch: 1 [400/2022 (20%)] Loss: 0.000023679231163
Train Epoch: 1 [800/2022 (40%)] Loss: 0.000008381257430
Train Epoch: 1 [1200/2022 (59%)] Loss: 0.000014029804333
Train Epoch: 1 [1600/2022 (79%)] Loss: 0.000005878804131
Train Epoch: 1 [2000/2022 (99%)] Loss: 0.000002487727670
Train Epoch: 2 [0/2022 (0%)] Loss: 0.000003322811153
Train Epoch: 2 [400/2022 (20%)] Loss: 0.000004343680757
Train Epoch: 2 [800/2022 (40%)] Loss: 0.000004866438303
Train Epoch: 2 [1200/2022 (59%)] Loss: 0.000002030236828
Train Epoch: 2 [1600/2022 (79%)] Loss: 0.000002176634098
Train Epoch: 2 [2000/2022 (99%)] Loss: 0.000003391759947
Train Epoch: 3 [0/2022 (0%)] Loss: 0.000001647578983
Train Epoch: 3 [400/2022 (20%)] Loss: 0.000001305555315
Train Epoch: 3 [800/2022 (40%)] Loss: 0.000002374987162
Train Epoch: 3 [1200/2022 (59%)] Loss: 0.000001534210355
Train Epoch: 3 [1600/2022 (79%)] Loss: 0.000004802184321
Train Epoch: 3 [2000/2022 (99%)] Loss: 0.000004886873739
Train Epoch: 4 [0/2022 (0%)] Loss: 0.000009389250408
Train Epoch: 4 [400/2022 (20%)] Loss: 0.000015918551071
Train Epoch: 4 [800/2022 (40%)] Loss: 0.000004416628599
Train Epoch: 4 [1200/2022 (59%)] Loss: 0.000000648438970
Train Epoch: 4 [1600/2022 (79%)] Loss: 0.000000662304274
Train Epoch: 4 [2000/2022 (99%)] Loss: 0.000009007605513
Train Epoch: 5 [0/2022 (0%)] Loss: 0.000003984898285
Train Epoch: 5 [400/2022 (20%)] Loss: 0.000001393861794
Train Epoch: 5 [800/2022 (40%)] Loss: 0.000004186821116
Train Epoch: 5 [1200/2022 (59%)] Loss: 0.000000712396343
Train Epoch: 5 [1600/2022 (79%)] Loss: 0.000000592076617
Train Epoch: 5 [2000/2022 (99%)] Loss: 0.000000580793312
Train Epoch: 6 [0/2022 (0%)] Loss: 0.000000495225947
Train Epoch: 6 [400/2022 (20%)] Loss: 0.000000627484724
Train Epoch: 6 [800/2022 (40%)] Loss: 0.000000910183417
Train Epoch: 6 [1200/2022 (59%)] Loss: 0.000000620817559
Train Epoch: 6 [1600/2022 (79%)] Loss: 0.000002660205155
Train Epoch: 6 [2000/2022 (99%)] Loss: 0.000001788120585
Train Epoch: 7 [0/2022 (0%)] Loss: 0.000002246752956
Train Epoch: 7 [400/2022 (20%)] Loss: 0.000000400816077
Train Epoch: 7 [800/2022 (40%)] Loss: 0.000000314132564
Train Epoch: 7 [1200/2022 (59%)] Loss: 0.000005796841833
Train Epoch: 7 [1600/2022 (79%)] Loss: 0.000000317658902
Train Epoch: 7 [2000/2022 (99%)] Loss: 0.000000164305803
Train Epoch: 8 [0/2022 (0%)] Loss: 0.000000189095587
Train Epoch: 8 [400/2022 (20%)] Loss: 0.000000214353605
Train Epoch: 8 [800/2022 (40%)] Loss: 0.000000203828307
Train Epoch: 8 [1200/2022 (59%)] Loss: 0.000000283352165
Train Epoch: 8 [1600/2022 (79%)] Loss: 0.000000136634853
Train Epoch: 8 [2000/2022 (99%)] Loss: 0.000000144215477
Train Epoch: 9 [0/2022 (0%)] Loss: 0.000000270883049
Train Epoch: 9 [400/2022 (20%)] Loss: 0.000000126488899
Train Epoch: 9 [800/2022 (40%)] Loss: 0.000000238181570
Train Epoch: 9 [1200/2022 (59%)] Loss: 0.000000171188475
Train Epoch: 9 [1600/2022 (79%)] Loss: 0.000000167808679
Train Epoch: 9 [2000/2022 (99%)] Loss: 0.000000133901352
Train Epoch: 10 [0/2022 (0%)] Loss: 0.000000186704639
Train Epoch: 10 [400/2022 (20%)] Loss: 0.000000156324205
Train Epoch: 10 [800/2022 (40%)] Loss: 0.000000202804600
Train Epoch: 10 [1200/2022 (59%)] Loss: 0.000000140662607
Hi,
Thanks a lot for that dont you have maybe link for this script ?
Thanks
From: HiHiAllen @.> Sent: Saturday, February 19, 2022 12:44 PM To: NVlabs/Deep_Object_Pose @.> Cc: Erik Kockar @.>; Author @.> Subject: Re: [NVlabs/Deep_Object_Pose] RuntimeError: CUDA out of memory (Training phase) (Issue #220)
You don't often get email from @.*** Learn why this is importanthttp://aka.ms/LearnAboutSenderIdentification
the loss looks really bad, I'm afraid the training doesn't really work. I have met this before , my dataset was generated at will ,and I suggest the background picture of your dataset should try to fit your work scene. you can see the belief maps to figure out whether the training is working. I run this script to see the belief maps.
'''
import cv2 import matplotlib.pyplot as plt import torch import os import numpy as np from inference.cuboid import Cuboid3d from inference.cuboid_pnp_solver import CuboidPNPSolver from inference.detector import ModelData, ObjectDetector import yaml from PIL import Image from PIL import ImageDraw from torch.autograd import Variable import torchvision.transforms as transforms import time os.environ['KMP_DUPLICATE_LIB_OK']='True'
Settings
name = 'flange_plate'
net_path = 'data/net/mustard_60.pth'
net_path = '/home/zxm/usb/ur5/src/DOPE-ROS-D435-dependabot-pip-dope-opencv-python-4.2.0.32/dope/weights/cracker_60.pth' gpu_id = 0 img_path = '/home/zxm/usb/ur5/src/DOPE-ROS-D435-dependabot-pip-dope-opencv-python-4.2.0.32/dope/scripts/cola_image/023.png'
img_path = 'data/images/cautery_real_1.jpg'
transform = transforms.Compose([
transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), ])
Function for visualizing feature maps
def viz_layer(layer, n_filters=9): fig = plt.figure(figsize=(20, 20)) for i in range(n_filters): ax = fig.add_subplot(4, 5, i + 1, xticks=[], yticks=[])
ax.imshow(np.squeeze(layer[i].data.numpy()), cmap='gray') ax.set_title('Output %s' % str(i + 1))
load color image
in_img = cv2.imread(img_path)
in_img = cv2.resize(in_img, (640, 480))
in_img = cv2.cvtColor(in_img, cv2.COLOR_BGR2RGB)
plot image
plt.imshow(in_img)
model = ModelData(name, net_path, gpu_id) model.load_net_model() net_model = model.net
Run network inference vertex affinity
image_tensor = transform(in_img) image_torch = Variable(image_tensor).cuda().unsqueeze(0)
out, seg = net_model(image_torch)
vertex2 = out[-1][0].cpu() aff = seg[-1][0].cpu()
View the vertex and affinities
viz_layer(vertex2) viz_layer(aff, n_filters=16)
plt.show()
Code to visualize the neural network output
def DrawLine(point1, point2, lineColor, lineWidth): '''Draws line on image''' global g_draw if not point1 is None and point2 is not None: g_draw.line([point1, point2], fill=lineColor, width=lineWidth)
def DrawDot(point, pointColor, pointRadius): '''Draws dot (filled circle) on image''' global g_draw if point is not None: xy = [ point[0] - pointRadius, point[1] - pointRadius, point[0] + pointRadius, point[1] + pointRadius ] g_draw.ellipse(xy, fill=pointColor, outline=pointColor )
def DrawCube(points, color=(255, 0, 0)): ''' Draws cube with a thick solid line across the front top edge and an X on the top face. '''
lineWidthForDrawing = 2
DrawLine(points[0], points[1], color, lineWidthForDrawing)
DrawLine(points[1], points[2], color, lineWidthForDrawing)
DrawLine(points[3], points[2], color, lineWidthForDrawing)
DrawLine(points[3], points[0], color, lineWidthForDrawing)
DrawLine(points[4], points[5], color, lineWidthForDrawing)
DrawLine(points[6], points[5], color, lineWidthForDrawing)
DrawLine(points[6], points[7], color, lineWidthForDrawing)
DrawLine(points[4], points[7], color, lineWidthForDrawing)
DrawLine(points[0], points[4], color, lineWidthForDrawing)
DrawLine(points[7], points[3], color, lineWidthForDrawing)
DrawLine(points[5], points[1], color, lineWidthForDrawing)
DrawLine(points[2], points[6], color, lineWidthForDrawing)
DrawDot(points[0], pointColor=color, pointRadius=4)
DrawDot(points[1], pointColor=color, pointRadius=4)
DrawLine(points[0], points[5], color, lineWidthForDrawing)
DrawLine(points[1], points[4], color, lineWidthForDrawing)
Settings
config_name = "my_config_realsense.yaml" exposure_val = 166
yaml_path = 'cfg/{}'.format(config_name) with open(yaml_path, 'r') as stream: try: print("Loading DOPE parameters from '{}'...".format(yaml_path)) params = yaml.load(stream) print(' Parameters loaded.') except yaml.YAMLError as exc: print(exc)
models = {}
pnp_solvers = {}
pub_dimension = {}
draw_colors = {}
matrix_camera = np.zeros((3,3))
matrix_camera[0,0] = params["camera_settings"]['fx']
matrix_camera[1,1] = params["camera_settings"]['fy']
matrix_camera[0,2] = params["camera_settings"]['cx']
matrix_camera[1,2] = params["camera_settings"]['cy']
matrix_camera[2,2] = 1
dist_coeffs = np.zeros((4,1))
if "dist_coeffs" in params["camera_settings"]:
dist_coeffs = np.array(params["camera_settings"]['dist_coeffs'])
config_detect = lambda: None
config_detect.mask_edges = 1
config_detect.mask_faces = 1
config_detect.vertex = 1
config_detect.threshold = 0.5
config_detect.softmax = 1000
config_detect.thresh_angle = params['thresh_angle']
config_detect.thresh_map = params['thresh_map']
config_detect.sigma = params['sigma']
config_detect.thresh_points = params["thresh_points"]
for model in params['weights']:
models[model] = \
ModelData(
model,
"weights/" + params['weights'][model]
)
models[model].load_net_model()
draw_colors[model] = tuple(params["draw_colors"][model])
pnp_solvers[model] = \
CuboidPNPSolver(
model,
matrix_camera,
Cuboid3d(params['dimensions'][model]),
dist_coeffs=dist_coeffs
)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) Copy and draw image
img_copy = in_img.copy() im = Image.fromarray(img_copy) g_draw = ImageDraw.Draw(im)
for m in models:
results = ObjectDetector.detect_object_in_image( models[m].net, pnp_solvers[m], in_img, config_detect )
for i_r, result in enumerate(results):
if result["location"] is None:
continue
loc = result["location"]
ori = result["quaternion"]
# Draw the cube
if None not in result['projected_points']:
points2d = []
for pair in result['projected_points']:
points2d.append(tuple(pair))
DrawCube(points2d, draw_colors[m])
open_cv_image = np.array(im)
open_cv_image = cv2.cvtColor(open_cv_image, cv2.COLOR_RGB2BGR) cv2.imshow('Open_cv_image', open_cv_image) cv2.waitKey(0)
plt.imshow(open_cv_image)
plt.show()
'''
— Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FNVlabs%2FDeep_Object_Pose%2Fissues%2F220%23issuecomment-1045998259&data=04%7C01%7Cerkou20%40student.sdu.dk%7C364f62c4f58049a7e86308d9f39d2040%7C9a97c27db83e4694b35354bdbf18ab5b%7C0%7C0%7C637808678473687481%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=pUIXW0DhliCWgzrncsYLPveL%2Fp%2BHWCS1H2q2FwnWO0g%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FARWAE3WMTQJXFODLQRO2WJLU357ADANCNFSM5OYJKXGQ&data=04%7C01%7Cerkou20%40student.sdu.dk%7C364f62c4f58049a7e86308d9f39d2040%7C9a97c27db83e4694b35354bdbf18ab5b%7C0%7C0%7C637808678473687481%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=wQ%2FzkUr13XIc6S5RkASICxtgukvbyJ5Lms10EWHY7Rg%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cerkou20%40student.sdu.dk%7C364f62c4f58049a7e86308d9f39d2040%7C9a97c27db83e4694b35354bdbf18ab5b%7C0%7C0%7C637808678473843685%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=iXfFYiMW9SjxO6qEuqXnCqVQGFMP7OYecRamyWykX1Y%3D&reserved=0 or Androidhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cerkou20%40student.sdu.dk%7C364f62c4f58049a7e86308d9f39d2040%7C9a97c27db83e4694b35354bdbf18ab5b%7C0%7C0%7C637808678473843685%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Ve28%2FN8KCkaeb8lwXDsYx8wcJQ4LbsiGbFX%2BvvLJTFY%3D&reserved=0. You are receiving this because you authored the thread.Message ID: @.***>
the loss looks really bad, I'm afraid the training doesn't really work. I suggest the background picture should try to fit the object's working scene. and dope has difficulty solving the problem of symmetry. you can see the belief maps to figure out whether the training is working. I use the python file below to see the belief maps: inference_on_img.txt
I am trying learning on this object: )
I might probably add more visual features. Do you think that this would work ?
Btw code inference_on_img.txt
is giving me error
erikkockar@erikkockar:~/catkin_ws/src/Deep_Object_Pose/scripts$ python3 belief.py
Traceback (most recent call last):
File "belief.py", line 6, in <module>
from inference.cuboid import Cuboid3d
ModuleNotFoundError: No module named 'inference.cuboid'
there is a folder "inference", (dope/src/inference) ,copy and put it in your path :~/catkin_ws/src/Deep_Object_Pose/scripts
sorry,I'm also a new hand, can't offer more help. hhh~ Let's wait for others
Perhaps the resolution of the training images is too high. Try scaling down the shorter side to about 500 px.
Hey. A batch size of 4 does not seem weird to me. The 64 was originally split across multiple GPUs and a single GPU system will handle much less.
For the low loss, I would make sure you have named your object correctly. There is a chance you are telling it to learn nothing. To verify that you can look at the belief maps
I have been training dope on different dgx system with v100 and p100. So yes a small batch size for a single gpu is normal. Also you could look into using sgd optimizer instead of Adam. This would also use less memory. But I warn you that the optimization will be harder. I never succeeded training dope with sgd.
On Sun, Feb 20, 2022 at 07:40 Blaine Miller @.***> wrote:
Hey. A batch size of 4 does not seem weird to me. The 64 was originally split across multiple GPUs and a single GPU system will handle much less.
— Reply to this email directly, view it on GitHub https://github.com/NVlabs/Deep_Object_Pose/issues/220#issuecomment-1046264434, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABK6JIE2M464RWBJ56G5FJ3U4EDORANCNFSM5OYJKXGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you are subscribed to this thread.Message ID: @.***>
sorry,I'm also a new hand, can't offer more help. hhh~ Let's wait for others @HiHiAllen Thanks it helped.
Perhaps the resolution of the training images is too high. Try scaling down the shorter side to about 500 px. @mintar Thanks you were right about it. My output images from ndds are now 640x480 (my realsense is set to this resolution) and it improved loss from e-5 to e-2.
1, 0,0.041358746588230 2, 0,0.011250677518547 3, 0,0.010450199246407 4, 0,0.007594730239362 ... 7, 0,0.003804703243077
Still think that its too low though. Also thinking about to go maybe lower around 320x240 what do you think about that.
Hey. A batch size of 4 does not seem weird to me. The 64 was originally split across multiple GPUs and a single GPU system will handle much less. @blaine141 Yes. Now I get it thanks. I am able to run max batch size of 6 and 20 epochs takes around 3 hours so its still okeyish.
For the low loss, I would make sure you have named your object correctly. There is a chance you are telling it to learn nothing. To verify that you can look at the belief maps Yes you were right i checked
_object_settings.json
and it was empty. I exported new correct dataset and my beliefs maps after 7 epochs looks like this.
But after applying those weights into dope my realsense camera still can not detect the object at all at 640x480 and 10fps.
I also got this snag the first time I trained. Off the top of my head I feel like I needed to train longer. I normally do 40 hours on a similar GPU. I will have to look back at my issues to see if it was anything else
little update :
I down scaled training images to 320x240 and created 5000 samples with NDDS.
Thanks to downscaling I was able to run it on batch size of 8 (Nvidia RTX 3060 - laptop) by command:
python3 train.py --batchsize 8 --data ~/ndds2/Dataset_Synthesizer/Source/NVCapturedData/TestCapturer/ --object neuron --outf neurontest --gpuids 0
After 35 epochs (whole night of training so 8hrs approximately) I am able to detect my robot in the environment, still lacking to detect from certain angles but that is matter of more training.
Thanks a lot to everyone If there is nothing more from you this issue can be closed.
Be careful: if you downscale your training images to 320x240, you should also later run inference at a similar resolution. You will get better results if you train at 640x480 and run inference at the same resolution. DOPE performs poorly if the "size in pixels" of an object (e.g., 100 pixels high) occurs in the test set, but not the training set. So if you have objects in the training set that are pretty close to the camera (so they are around 200 pixels high at 320x240 resolution), you will not be able to detect objects at a similar distance if you use a 640x480 resolution during testing (because then the object would be around 400 pixels high).
Of course you can also downscale the inference images, but that makes the objects harder to detect (especially if they are far away from the camera).
Hi guys,
I am trying to train DOPE NN using your training script. After command
python3 train.py --data ~/ndds/Dataset_Synthesizer/Source/NVCapturedData/TestCapturer/ --object neuron --outf neuron --gpuids 0
I get output:
start: 16:25:23.770094 load data training data: 64 batches load models Training network pretrained on imagenet. Traceback (most recent call last): File "train.py", line 1392, in <module> _runnetwork(epoch,trainingdata) File "train.py", line 1334, in _runnetwork output_belief, output_affinities = net(data) File "/home/erikkockar/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/home/erikkockar/.local/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward return self.module(*inputs[0], **kwargs[0]) File "/home/erikkockar/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "train.py", line 153, in forward out1 = self.vgg(x) File "/home/erikkockar/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/home/erikkockar/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward input = module(input) File "/home/erikkockar/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/home/erikkockar/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 443, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/erikkockar/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 439, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: CUDA out of memory. Tried to allocate 626.00 MiB (GPU 0; 5.78 GiB total capacity; 3.60 GiB already allocated; 283.88 MiB free; 3.62 GiB reserved in total by PyTorch)
Output of
nvidia-smi
My GPU is Nvidia Gforce RTX 3060 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A | | N/A 41C P8 17W / N/A | 492MiB / 6144MiB | 4% Default | | | | N/A | +-------------------------------+----------------------+----------------------++-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1082 G /usr/lib/xorg/Xorg 59MiB | | 0 N/A N/A 1627 G /usr/lib/xorg/Xorg 189MiB | | 0 N/A N/A 1755 G /usr/bin/gnome-shell 48MiB | | 0 N/A N/A 2445 G /usr/lib/firefox/firefox 183MiB | +-----------------------------------------------------------------------------+
I also modified cuda visible devices to 0 since my output of
nvidia-smi
is 0os.environ["CUDA_VISIBLE_DEVICES"]="0"
After searching for a problem on the internet I could find just try to reboot or try different torch versions. Now I am using 1.9.0 but also tried 1.9.1 and 1.10.2.
Do you know where the problem may be ? I think that my GPU has enough computation power to do training, or am I wrong ? Thanks for answers