MVIG-SJTU / AlphaPose

Real-Time and Accurate Full-Body Multi-Person Pose Estimation&Tracking System
http://mvig.org/research/alphapose.html
Other
8.01k stars 1.97k forks source link

Error in Lua while training #378

Closed Sentient07 closed 5 years ago

Sentient07 commented 5 years ago

Hello,

First I would like to thank you for open sourcing the code. I have been trying to re-train the model(Torch code) on COCO and MPII dataset. My training works well for 15 iterations and later, I face a file not found error. From what I understand, the code is looking for a file with an encoding that my system is not aware of. Below are my traceback attached,

Training error

root@8bbb2c22127d:~/AlphaPose/train/src# th main.lua -expID coco1 -nGPU 2 -trainBatch 4 -validBatch 16 -nEpochs 40
Saving everything to: /root/AlphaPose/train/exp/coco/coco1
Input is a vector of length: 3
Output is a table
         Entry 1 is a tensor with dimensions: 33 x 80 x 64
         Entry 2 is a tensor with dimensions: 33 x 80 x 64
         Entry 3 is a tensor with dimensions: 33 x 80 x 64
         Entry 4 is a tensor with dimensions: 33 x 80 x 64
         Entry 5 is a tensor with dimensions: 33 x 80 x 64
         Entry 6 is a tensor with dimensions: 33 x 80 x 64
         Entry 7 is a tensor with dimensions: 33 x 80 x 64
         Entry 8 is a tensor with dimensions: 33 x 80 x 64
==> Creating model from file: models/hg-prm.lua
==> Converting module to nn.DataParallelTable
warning: could not load nccl, falling back to default communication
==> Converting model to CUDA
==> Starting epoch: 1/40
/root/torch/install/bin/luajit: /root/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 3 callback] /root/torch/install/share/lua/5.1/image/init.lua:367: /root/AlphaPose/train/data/coco/images/COCO_train2014_000000044788.jpgueue:addjA: No such file or directory
stack traceback:
        [C]: in function 'error'
        /root/torch/install/share/lua/5.1/image/init.lua:367: in function 'loadImage'
        /root/AlphaPose/train/src/util/pose.lua:57: in function 'generateSample'
        /root/AlphaPose/train/src/util/pose.lua:286: in function 'loadData'
        /root/AlphaPose/train/src/util/dataloader.lua:77: in function </root/AlphaPose/train/src/util/dataloader.lua:76>
        [C]: in function 'xpcall'
        /root/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback'
        /root/torch/install/share/lua/5.1/threads/queue.lua:65: in function </root/torch/install/share/lua/5.1/threads/queue.lua:41>
        [C]: in function 'pcall'
        /root/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob'
        [string "  local Queue = require 'threads.queue'..."]:15: in main chunk
stack traceback:
        [C]: in function 'error'
        /root/torch/install/share/lua/5.1/threads/threads.lua:183: in function 'dojob'
        /root/AlphaPose/train/src/util/dataloader.lua:90: in function '(for generator)'
        /root/AlphaPose/train/src/train.lua:38: in function 'step'
        /root/AlphaPose/train/src/train.lua:164: in function 'train'
        main.lua:19: in main chunk
        [C]: in function 'dofile'
        /root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: at 0x00405d50

As you could see, the code is looking for COCO_train2014_000000044788.jpgueue:addjA which obviously is not a valid name for a file. My knowledge in Lua is limited and I am unable to debug this error. Any help would be much appreciated.

Thanks and Cheers,

Sentient07 commented 5 years ago

Thanks to the suggestion made by @Alan-Woo : this comment fixed it for me : https://github.com/MVIG-SJTU/AlphaPose/issues/68#issuecomment-397922448