Closed newforrestgump001 closed 5 years ago
Hi, I updated your comment to make it easier to read. The error message is because your labels are out of range. The cityscapes data needs to be preprocessed before use, to put all labels in the 0-19 range, using their api, which you can access here. The definition of the mapping for each label is defined by the user, and can be found on this script of their api. I usually replace the trainIds 255 and -1 by 19 to make a consistent cross-entropy-able label set.
I got it, sincere thanks for your quick reply! Jack
@tano297 I did want to train the segmentation model with my custom dataset (only 1 class foreground and background), but I don't know what to do ,so I use the cityscaple at first to have a try. I wonder what should I modify the train.py to make it work? Thank you a lot! Jack
If you want to check out a dataset that already does a foreground vs background sort of segmentation, I suggest you have a look at the people segmentation model. No modifications should be necessary for the train.py file, all you may need to do is to modify its parser to make the __getitem__
function access your data in the proper way, and the corresponding configuration file
It is my own datasets which I have annotated already. In which (0,255,0,64) stands for foreground, and (0,0,0,0) stands for background. I will have a try now, thank you for your help! Jack
Then your __getitem__
function should know how to open these images and labels, and use some numpy magic to get from color images containing (0,0,0,0) or (0,255,0,64) to monochrome images that contain 0 or 1, so that the loss function can be evaluated
@tano297 Thanks for your team's great work! I wonder what is the schedule for detection, instance segmentation part?
Thank you for the thanks :) It requires quite some work to maintain all the repos in our group, and most of the time it goes unacknowledged.
Detection is currently being worked on, so probably in october/november we will have some fast models that are easy to train, which is the whole point of the framework. We're not going for state of the art, but rather for well-tested archs, that provide robust results, and are easy to train and fast to infer.
In terms of instance segmentation, there is currently not a "standard" way to do it, so we are trying to figure out what is the best architecture we can come up with that makes inference fast for robotics, easy to maintain code, and still with good accuracy. We have some work on panoptic segmentation coming, so I think that this is the code that will be released, rather than our ICRA2019 paper, which is accurate, but not easy to implement and maintain in the long terms (too many components and too architecture/dataset dependent)
@tano297 I got it, waiting for them patiently! Bonnet frameworks including tf-based and pytorch-based are really awesome!
@tano297 Sorry to trouble you! I have tried again and again and the error still appears RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED I have attached the cfg and sample images in the mail to amilioto@uni-bonn.de, Could you help me out ? I'm not sure is there a problem to my environment. Thanks! Jack
Hi,
If you load that data, and you tap into the parser, inserting print(np.unique(np.array(label)))
, which gives you the unique values included in the label you get [ 0 255]
. This should be [0 1]
. This is because your image contains 0 and 255 rather than 0 and 1 in the png label.
If I add label = (np.array(label) > 0).astype(np.int)
right before calling tensorize_lbl
(for example, in line 134), then it works. I was able to make it train with this data you sent.
If this works for you, feel free to close this issue.
@tano297 It really works. Sincere thanks!
Glad to hear that. I'd like to thank you for sending me that data the and cfg file the way you did. It took me 2 minutes to reproduce the error and find how to help you. If more people did it like that, maintaining these packages would be significantly easier
@tano297 Very nice to help me to solve the problem. If new problem appears I will send the sample data and cfg later to save your time. Thanks, Jack
@tano297 @newforrestgump001 I am finding the same error but do not seem to be able to solve it. I have changes the labels and preprocessed the label file (changed labels.py
and ran python createTrainIdLabelImgs.py
) but the code still exits before completing
File ../../tasks/segmentation/modules/trainer.py, line 488, in train_epoch loss.backward()
Do you have any idea what I could do to solve this issue?
My labels.py
file in cityscapes:
labels = [
# name id trainId category catId hasInstances ignoreInEval color
Label( 'unlabeled' , 0 , 19 , 'void' , 0 , False , True , ( 0, 0, 0) ),
Label( 'ego vehicle' , 1 , 19 , 'void' , 0 , False , True , ( 0, 0, 0) ),
Label( 'rectification border' , 2 , 19 , 'void' , 0 , False , True , ( 0, 0, 0) ),
Label( 'out of roi' , 3 , 19 , 'void' , 0 , False , True , ( 0, 0, 0) ),
Label( 'static' , 4 , 19 , 'void' , 0 , False , True , ( 0, 0, 0) ),
Label( 'dynamic' , 5 , 19 , 'void' , 0 , False , True , (111, 74, 0) ),
Label( 'ground' , 6 , 19 , 'void' , 0 , False , True , ( 81, 0, 81) ),
Label( 'road' , 7 , 0 , 'flat' , 1 , False , False , (128, 64,128) ),
Label( 'sidewalk' , 8 , 1 , 'flat' , 1 , False , False , (244, 35,232) ),
Label( 'parking' , 9 , 19 , 'flat' , 1 , False , True , (250,170,160) ),
Label( 'rail track' , 10 , 19 , 'flat' , 1 , False , True , (230,150,140) ),
Label( 'building' , 11 , 2 , 'construction' , 2 , False , False , ( 70, 70, 70) ),
Label( 'wall' , 12 , 3 , 'construction' , 2 , False , False , (102,102,156) ),
Label( 'fence' , 13 , 4 , 'construction' , 2 , False , False , (190,153,153) ),
Label( 'guard rail' , 14 , 19 , 'construction' , 2 , False , True , (180,165,180) ),
Label( 'bridge' , 15 , 19 , 'construction' , 2 , False , True , (150,100,100) ),
Label( 'tunnel' , 16 , 19 , 'construction' , 2 , False , True , (150,120, 90) ),
Label( 'pole' , 17 , 5 , 'object' , 3 , False , False , (153,153,153) ),
Label( 'polegroup' , 18 , 19 , 'object' , 3 , False , True , (153,153,153) ),
Label( 'traffic light' , 19 , 6 , 'object' , 3 , False , False , (250,170, 30) ),
Label( 'traffic sign' , 20 , 7 , 'object' , 3 , False , False , (220,220, 0) ),
Label( 'vegetation' , 21 , 8 , 'nature' , 4 , False , False , (107,142, 35) ),
Label( 'terrain' , 22 , 9 , 'nature' , 4 , False , False , (152,251,152) ),
Label( 'sky' , 23 , 10 , 'sky' , 5 , False , False , ( 70,130,180) ),
Label( 'person' , 24 , 11 , 'human' , 6 , True , False , (220, 20, 60) ),
Label( 'rider' , 25 , 12 , 'human' , 6 , True , False , (255, 0, 0) ),
Label( 'car' , 26 , 13 , 'vehicle' , 7 , True , False , ( 0, 0,142) ),
Label( 'truck' , 27 , 14 , 'vehicle' , 7 , True , False , ( 0, 0, 70) ),
Label( 'bus' , 28 , 15 , 'vehicle' , 7 , True , False , ( 0, 60,100) ),
Label( 'caravan' , 29 , 19 , 'vehicle' , 7 , True , True , ( 0, 0, 90) ),
Label( 'trailer' , 30 , 19 , 'vehicle' , 7 , True , True , ( 0, 0,110) ),
Label( 'train' , 31 , 16 , 'vehicle' , 7 , True , False , ( 0, 80,100) ),
Label( 'motorcycle' , 32 , 17 , 'vehicle' , 7 , True , False , ( 0, 0,230) ),
Label( 'bicycle' , 33 , 18 , 'vehicle' , 7 , True , False , (119, 11, 32) ),
Label( 'license plate' , -1 , 19 , 'vehicle' , 7 , False , True , ( 0, 0,142) ),
]
Traceback:
./train.py -c ~/bonnetal/train/tasks/segmentation/config/cityscapes/ERFNet.yaml -l ~/bonnetal/train/tasks/segmentation/log1
/home/cris/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/cris/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/cris/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/cris/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/cris/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/cris/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
/home/cris/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/cris/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/cris/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/cris/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/cris/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/cris/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
----------
INTERFACE:
config yaml: /home/cris/bonnetal/train/tasks/segmentation/config/cityscapes/ERFNet.yaml
log dir /home/cris/bonnetal/train/tasks/segmentation/log1
model path None
eval only False
No batchnorm False
----------
Commit hash (training version): b'5368eed'
----------
Opening config file /home/cris/bonnetal/train/tasks/segmentation/config/cityscapes/ERFNet.yaml
No pretrained directory found.
Copying files to /home/cris/bonnetal/train/tasks/segmentation/log1 for further reference.
WARNING:tensorflow:From ../../common/logger.py:16: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.
Images from: ~/bonnetal/cityscapes/leftImg8bit/train
Labels from: ~/bonnetal/cityscapes/gtFine/train
LENGTH 2975 2975
Inference batch size: 4
Images from: ~/bonnetal/cityscapes/leftImg8bit/val
Labels from: ~/bonnetal/cityscapes/gtFine/val
LENGTH 500 500
Original OS: 8
New OS: 8
Trying to get backbone weights online from Bonnetal server.
Using pretrained weights from bonnetal server for backbone
OS: 1 , channels: 16
OS: 2 , channels: 16
OS: 4 , channels: 64
[Decoder] os: 4 in: 128 skip: 64 out: 64
[Decoder] os: 2 in: 64 skip: 16 out: 16
[Decoder] os: 1 in: 16 skip: 3 out: 16
Using normalized weights as bias for head.
No path to pretrained, using bonnetal Imagenet backbone weights and random decoder.
Total number of parameters: 2252148
Total number of parameters requires_grad: 2252148
Param encoder 1913168
Param decoder 338640
Param head 340
Training in device: cuda
/home/cris/.local/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:100: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule.See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Ignoring class 19 in IoU evaluation
[IOU EVAL] IGNORE: tensor([19])
[IOU EVAL] INCLUDE: tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18])
Let's see if it finishes this
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [1,0,0], thread: [576,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [1,0,0], thread: [577,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [1,0,0], thread: [578,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [1,0,0], thread: [579,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
File "./train.py", line 117, in <module>
trainer.train()
File "../../tasks/segmentation/modules/trainer.py", line 302, in train
scheduler=self.scheduler)
File "../../tasks/segmentation/modules/trainer.py", line 488, in train_epoch
loss.backward()
File "/home/cris/.local/lib/python3.6/site-packages/torch/tensor.py", line 166, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/cris/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED