Open rezha130 opened 6 years ago
Hi,
Can you provide small reproducer for this bug?
Sorry @BelBES , would you please explain about "small reproducer"?
FYI, this is structure of my custom data set:
datatrain
---- data
-------- folderA/img_filename_0.jpg
...
-------- folderB/img_filename_1.jpg
---- desc.json
And, this is structure of my custom desc.json
:
{
"abc": "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz:/.",
"train": [
{
"text": "text_on_image0"
"name": "folderA/img_filename_0.jpg"
},
...
{
"text": "text_on_image1"
"name": "folderB/img_filename_1.jpg"
}
],
"test": [
{
"text": "text_on_image3"
"name": "folderC/img_filename_3.jpg"
},
...
{
"text": "text_on_image4"
"name": "folderD/img_filename_4.jpg"
}
]
}
In text_data.py
, i used this syntax in line 32:
img = cv2.imread(os.path.join(self.data_path, "data", name))
But still have same loss : nan
issue. Please help.
When i tried to debug using cuda = false
(in CPU) on my dev laptop, this is the result of loss.data[0]
that cause loss : nan
[0]:<Tensor>
_backward_hooks:None
_base:<Tensor, len() = 1>
_cdata:140460563260592
_grad:None
_grad_fn:None
_version:0
data:<Tensor>
device:device(type='cpu')
dtype:torch.float32
grad:None
grad_fn:None
is_cuda:False
is_leaf:True
is_sparse:False
layout:torch.strided
name:None
output_nr:0
Note: i set cuda = False in my CPU dev-laptop, but set cuda = True on my GPU server above.
Hi @BelBES
I tried several batch-size from 8,16,32,64,128,256..but always end with
loss : nan
in every epoch when training my custom data set.python train.py --data-path datatrain --test-init True --test-epoch 10 --output-dir snapshot --abc 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz:/. --batch-size 8
I am using PyTorch 0.4, Python 3.6, GTX 1080 Ti and Ubuntu 16.04
Can you help me how to solve this problem?
Kindly Regards