Open hoangthienan95 opened 4 years ago
@hoangthienan95
Did you create your own dataset according to COCO annotation format? It means that in your annotation file, you should generate it with the same format as COCO.
Did you change the directory to your own dataset?
@ljjyxz123 1.yes I did. My json is here: https://raw.githubusercontent.com/hoangthienan95/computer-vision-project/master/an/for_vivek_train/fashion_test_dataset_10k.json
Your image ids should be numbers--pretty sure that's why you're getting the second error (right now you have image ids as file names). Preferably sequential starting from 1, but I'm not sure what the acceptable range of values is for pycocotools. Other things may be wrong, but that's what immediately stood out to me.
As for the first error, apply the change in #36. Those errors are expected because the number of classes changed, and you can ignore them with the code from that issue.
Hi @dbolya thanks so much for the quick reply! I recreated the json file but ran into another error:
Command: python train.py --config=yolact_base_config --batch_size 1
(have to put batch size =1 due to only 2 test images or else division by zero error)
Initializing weights...
Begin training!
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 724, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/queue.py", line 173, in get
self.not_empty.wait(remaining)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/threading.py", line 299, in wait
gotit = waiter.acquire(True, timeout)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 5348) is killed by signal: Segmentation fault.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 501, in <module>
train()
File "train.py", line 267, in train
for datum in data_loader:
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 804, in __next__
idx, data = self._get_data()
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 761, in _get_data
success, data = self._try_get_data()
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 737, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 5348) exited unexpectedly
When I change --numworkers
to 0: python train.py --config=yolact_base_config --batch_size 1 --num_workers 0
Initializing weights...
Begin training!
Segmentation fault
I'm using AWS g3s.xlarge instance so there should be plenty of GPU memory for one image. Do you have any insights?
You're using all the GPUs on the system with a batch size of 1. That means 1 GPU is is getting 1 image and the rest are getting 0 (hence the div by 0 / segfaults). To train with all your GPUs, just change the batch size to 6-8 * num_gpus (whatever fits). Otherwise, train with one GPU by running "export CUDA_VISIBLE_DEVICES=0" (or the idx of the desired GPU) first.
Thank you, I did "export CUDA_VISIBLE_DEVICES=0" since my instance only have 1 GPU and still have errors: python train.py --config=yolact_base_config --num_workers 0
and python train.py --config=yolact_base_config
I will try with a larger EC2 instance but I doubt that this is caused by the images being too big.
Sorry I misinterpreted your previous comment. Actually given that these seem to be issues with C (segfault, bad mallocs), maybe your environment is corrupt or something? Can you try setting up your environment again from scratch?
@dbolya I train my custom data in yolact plus resnet50 , but it costs a lot of memory of my Ram :( , 32GB Ram is not enough for training. So do you have some way to solve this problem ? i changed batch size = 4 , workers = 1 , but not efficient . Please help me :(
@oggyfaker Could you share your a json file in your dataset for me? I encountered the second problem similar to the above: Error(s) in loading state_dict for Yolact: size mismatch for maskiou_net.maskiou_net.10.weight: copying a param with shape torch.Size([80, 128, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 128, 1, 1]). size mismatch for maskiou_net.maskiou_net.10.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]). size mismatch for prediction_layers.0.conf_layer.weight: copying a param with shape torch.Size([729, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([18, 256, 3, 3]). size mismatch for prediction_layers.0.conf_layer.bias: copying a param with shape torch.Size([729]) from checkpoint, the shape in current model is torch.Size([18]). size mismatch for semantic_seg_conv.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 256, 1, 1]). size mismatch for semantic_seg_conv.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]). Thanks a lot !
Hi I'm pretty new to CV and DL. I'm trying to get my dataset to work with YOLACT to compare with Mask R-CNN. I followed the steps in the repo and my folder is here on github with only 10 train and 1 test images as sample. Would you mind help pointing me to what I need to do?
**My `config.py` changes**
```python IMAT_CLASSES = ('shirt, blouse', 'top, t-shirt, sweatshirt', 'sweater', 'cardigan', 'jacket', 'vest', 'pants', 'shorts', 'skirt', 'coat', 'dress', 'jumpsuit', 'cape', 'glasses', 'hat', 'headband, head covering, hair accessory', 'tie', 'glove', 'watch', 'belt', 'leg warmer', 'tights, stockings', 'sock', 'shoe', 'bag, wallet', 'scarf', 'umbrella', 'hood', 'collar', 'lapel', 'epaulette', 'sleeve', 'pocket', 'neckline', 'buckle', 'zipper', 'applique', 'bead', 'bow', 'flower', 'fringe', 'ribbon', 'rivet', 'ruffle', 'sequin', 'tassel') imaterialist_dataset = dataset_base.copy({ 'name': 'imaterialist', 'train_images': '../coco_format_train', 'train_info': '../fashion_train_dataset_10k.json', 'valid_images': '../coco_format_test', 'valid_info': '../fashion_test_dataset_10k.json', 'has_gt': True, 'class_names': IMAT_CLASSES, }) ... yolact_base_config = coco_base_config.copy({ 'name': 'yolact_base', # Dataset stuff 'dataset': imaterialist_dataset,#coco2017_dataset, 'num_classes': len(imaterialist_dataset.class_names) + 1 ... ```Try to go from pretrained YOLACT weights on COCO:
python train.py --config=yolact_base_config --resume=weights/yolact_resnet50_54_800000.pth --start_iter=-1
Error
```python ... index created! Resuming training, loading weights/yolact_resnet50_54_800000.pth... Traceback (most recent call last): File "train.py", line 501, inTry to train from scratch:
python train.py --config=yolact_base_config
(I think this error is because of malformatted json file but I couldn't find where I went wrong...)Error
```python ... index created! Initializing weights... Begin training! Traceback (most recent call last): File "train.py", line 502, in