facebookresearch / deepmask

Torch implementation of DeepMask and SharpMask
Other
3.11k stars 506 forks source link

COCO working - Training own dataset, one category - process stuck at "Start training" #87

Closed chrieke closed 7 years ago

chrieke commented 7 years ago

Hi, I adjusted my own dataset to fit the COCO format, should be correct (bbox and area are dummy values but that shouldn't matter right?).

Training on COCO is working for me, so everything is set up. I have only one category of polygons, the polygons are encoded as [[x, y, x1, y1, ...]. Switched the number of categories in Datasampler from 80 to 1 as described in issue 72.. The images are 640x480 jpg.

When starting the training, the json files are converted to t7, but then the process seems to be "stuck" at "start training". Any idea what could be wrong? Thanks!

gnc10 commented 7 years ago

Hi, during training 1 epoch for deepmask takes around 30 mins on a nvidia gtx 960 so maybe wait and see. Also if you run watch nvidia-smi it will show if there are any processes running on the GPU.

chrieke commented 7 years ago

Problem fixed, was a dumb polygon / pixel overlap issue (y axis was inverted) and/or also some images without segments in it. Not a single epoch finished on p2 instance for 8 hours. Now around ~1 hour per epoch.

ps48 commented 7 years ago

Hi @chrisckr did you use the coco API to check the overlap? and then come to a conclusion, because i am facing the same issue now.

chrieke commented 7 years ago

@ps48 I prepared this Jupyter Notebook to visually check the exact overlay. Just replace in_json and in_folder with your own data to see if it exactly fits the COCO dataset. Hope this helps! https://github.com/chrisckr/COCO_misc/blob/master/COCO_dataExploration.ipynb

ps48 commented 7 years ago

Hey @chrisckr, thanks for the ipython code, I did the similar things using coco API. I just wanted to know if it is necessary to have a bounding box to run deepmask or segmentation in itself is enough for deepmask to run?

I overlapped the polygon and it was correct still getting the same issue. Thanks :)

chrieke commented 7 years ago

@ps48 Haven't checked but my guess is that you can replace "bbox": [69.64, 205.24, 61.16, 50.76] by "bbox":[], Deepmask should only use the segmentation. Getting bounding box coordinates from polygons is pretty easy, for the shapely library it would just be polygon.bounds.

ps48 commented 7 years ago

@chrisckr Thanks, I did that using opencv contour Bounding Rectangle (simple python script). But still no clue on why the training is stuck on "Start Training".

~/torch/deepmask$ th train.lua -- ignore option rundir -- ignore option dm -- ignore option reload -- ignore option gpu
-- ignore option datadir
| running in directory /home/ubuntu/torch/deepmask/exps/deepmask/exp
| number of paramaters trunk: 15198016
| number of paramaters mask branch: 1608768 | number of paramaters score branch: 526337 | number of paramaters total: 17333121
| start training

chrieke commented 7 years ago

@ps48 Check that every image has at least one segmentation in the json file. I just rememberd that I had another thing to fix: I threw out polygons below a certain area treshhold but forgot to remove images that were without polygons afterwards. Also doublecheck the correct json formatting (brackets and stuff) and Image / segmentation IDs.

ps48 commented 7 years ago

Thank you @chrisckr, I had some similar issues. 👍 It works now but, it stopped again after two epochs. I think i need to change some target value in one of the files, which is hard coded for coco classes. Apart from the Datasampler one as described in #72

~/torch/deepmask$ th train.lua -batch 1 batch 1 32
-- ignore option rundir -- ignore option dm -- ignore option reload -- ignore option gpu
-- ignore option datadir
| running in directory /home/ubuntu/torch/deepmask/exps/deepmask/exp,batch=1
| number of paramaters trunk: 15198016
| number of paramaters mask branch: 1608768 | number of paramaters score branch: 526337 | number of paramaters total: 17333121
| start training
[train] | epoch 00001 | s/batch 0.07 | loss: 0.54912
[train] | epoch 00002 | s/batch 0.07 | loss: 0.69407
/home/ubuntu/torch/install/bin/luajit: /home/ubuntu/torch/deepmask/trainMeters.lua:57: attempt to index local 'output' (a number value) stack traceback: /home/ubuntu/torch/deepmask/trainMeters.lua:57: in function 'add' /home/ubuntu/torch/deepmask/TrainerDeepMask.lua:133: in function 'test' train.lua:118: in main chunk [C]: in function 'dofile' ...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00405d50

chrieke commented 7 years ago

The Datasampler one is the only variable I changed in regard to the number of classes. Check your validation data set, maybe ithas the same problems still? After every 2 epochs deepmask atempts to test IoU and accuracy, so the next line in your output would be [test]...Seems like the error comes from that step, see the error message "in function test", the 2 training epochs worked fine.

ps48 commented 7 years ago

@chrisckr no luck, the validation json is made from the same script as training json. Hence the val data is correct. I commented the validation line in train.lua (line no. 118) to try out training and it works. But resuts are very bad.

avilash commented 7 years ago

@ps48 Did you get desirable results on your own dataset ? The network trained on COCO works better on my dataset than the one I finetuned , although I ran it only for two epochs