facebookresearch / deepmask

Torch implementation of DeepMask and SharpMask
Other
3.11k stars 507 forks source link

T_END at character 1 when running th train.lua #107

Closed Minilei89 closed 7 years ago

Minilei89 commented 7 years ago

Found Environment variable CUDNN_PATH = /usr/local/cuda-8.0/lib64/libcudnn.so.5-- ignore option rundir
-- ignore option dm -- ignore option reload -- ignore option gpu
-- ignore option datadir
| running in directory /home/lei/deepmask/exps/deepmask/exp | number of paramaters trunk: 15198016
| number of paramaters mask branch: 1608768 | number of paramaters score branch: 526337 | number of paramaters total: 17333121
convert: data//annotations/instances_train2014.json --> .t7 [please be patient] convert: data//annotations/instances_train2014.json --> .t7 [please be patient] /home/lei/torch/install/bin/luajit: /home/lei/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 2 callback] /home/lei/torch/install/share/lua/5.1/coco/CocoApi.lua:142: Expected value but found T_END at character 1 stack traceback: [C]: in function 'decode' /home/lei/torch/install/share/lua/5.1/coco/CocoApi.lua:142: in function 'convert' /home/lei/torch/install/share/lua/5.1/coco/CocoApi.lua:128: in function 'init' /home/lei/torch/install/share/lua/5.1/torch/init.lua:91: in function </home/lei/torch/install/share/lua/5.1/torch/init.lua:87> [C]: in function 'CocoApi' /home/lei/deepmask/DataSampler.lua:25: in function 'init' /home/lei/torch/install/share/lua/5.1/torch/init.lua:91: in function </home/lei/torch/install/share/lua/5.1/torch/init.lua:87> [C]: in function 'DataSampler' /home/lei/deepmask/DataLoader.lua:36: in function </home/lei/deepmask/DataLoader.lua:30> [C]: in function 'xpcall' /home/lei/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback' /home/lei/torch/install/share/lua/5.1/threads/queue.lua:65: in function </home/lei/torch/install/share/lua/5.1/threads/queue.lua:41> [C]: in function 'pcall' /home/lei/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob' [string " local Queue = require 'threads.queue'..."]:13: in main chunk stack traceback: [C]: in function 'error' /home/lei/torch/install/share/lua/5.1/threads/threads.lua:183: in function 'dojob' /home/lei/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize' /home/lei/torch/install/share/lua/5.1/threads/threads.lua:142: in function 'specific' /home/lei/torch/install/share/lua/5.1/threads/threads.lua:125: in function 'Threads' /home/lei/deepmask/DataLoader.lua:40: in function 'init' /home/lei/torch/install/share/lua/5.1/torch/init.lua:91: in function </home/lei/torch/install/share/lua/5.1/torch/init.lua:87> [C]: in function 'DataLoader' /home/lei/deepmask/DataLoader.lua:21: in function 'create' train.lua:101: in main chunk [C]: in function 'dofile' .../lei/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00405d50

I'm currently just trying to set up the training using the MSCOCO data set and imagenet in the readme. Continue to get this error. I went into the instances_train2014.json to see if there was perhaps a character out of place but it looks to be of the appropriate format (strictly just downloaded the zip and extracted into $DEEPMASK/data. Not sure what to do at this point.

Update: I've tried both with Luajit, Lua5.2, Lua5.1, and Lua5.3. None of these work for training. Luajit is the only one that is capable of running computeProposals.lua without giving unknown object error as well. For hardware, I'm using a R7 1700x with a GTX 1080 ti. Software, I'm just running Ubuntu 16.04. It seems that no matter what modifications I put in the json file, a T_END is always found at character 1. Any help would be greatly appreciated.

hhung516 commented 7 years ago

I have the exact issue as well. Ubuntu 16.04.2 w/ GTX 980Ti.

computeProposals.lua works fine.

I was following this section https://github.com/facebookresearch/deepmask#training-your-own-model and downloaded/extracted the COCO train/val 2014 dataset.

Anyone willing to share how to fix it?

Thanks in advance!

Minilei89 commented 7 years ago

Have you tried decoding the Json file? I've tried with python but python also gave me a similar error. I'm going to try with Centos and I'll let you know if it works on another OS

hhung516 commented 7 years ago

That's a very good point, no I haven't. Will try later this evening. I'm also thinking I'll try running on a snippet of the json file see what I get.

I did download the instances_train-val2014.zip and extracted them initially on Windows, but I just did it again on Ubuntu and got same error, so it shouldn't be related to OS I guess?

Btw, do you know how to train with other datasets, for example PASCAL? Seems they use RLE encoding for the masks instead of polygons. I read it somewhere in the issues that deepmask only works with polygons. My own dataset has binary masks and I think it's not trivial to convert it to polygons.

Minilei89 commented 7 years ago

Wait, what CPU are you using out of curiosity?

So after trying again on Centos, I got the same error. Also this is my first dataset I've worked on training (which has been going terribly) so not much I can help you with unfortunately.

hhung516 commented 7 years ago

CPU is i7 6700K, would that matter you think?

So I tried the following today, not getting anywhere still :(

Downloaded a "fake" json from https://github.com/pdollar/coco/blob/master/results/instances_val2014_fakesegm100_results.json and simply renamed it to instances_val2014.json

Used rapidjson (c++) to parse both instances_train2014.json and instances_val2014.json then save out new files.

Still the same T_END at 1 error.

At this point, I think the error shouldn't be coming from the json files.

So the error points to line 142 in CocoApi.lua, which is the following. data = json.decode(data); collectgarbage()

Next step I'll look into lua-cjson, as I suspect something happened to the latest release. It's strange that nobody ran into this exact issue before.

EDIT: ok forget lua-cjson, nobody touched that since 2012. I looked briefly in the comments at the beginning of CocoApi.lua, it says the json string at line 142 was constructed through coco. Will drill down from there.

Minilei89 commented 7 years ago

Yea looks like it might be the Coco API. I already posted an issue and hopefully someone addresses it. Going to try an older version of the api and see if that fixes it. Thanks for the help!

hhung516 commented 7 years ago

That's right, after executing line 142, data is empty. And the issue comes from torch.CharStorage(), I did a separate test, it wouldn't even load a simple text file. Not sure how to proceed from here.

EDIT: where did you post the issue? I just went over to torch7 on git and looked it up torch.CharStorage(), nothing major really.

Minilei89 commented 7 years ago

Confirmed that it's a recent bug in torch7. Should start working once the bug has been fixed or you can use an older version if you don't want to wait. Thanks for the help troubleshooting!

Minilei89 commented 7 years ago

So I've managed to get th train.lua to start training by simply cloning the torch distro and then using "git checkout 5961f52a65fe33efa675f71e5c19ad8de56e8dad". From there just follow the general steps to setting up deepmask. Do not run "luarocks install torch" as this will update your torch, which currently has a bug causing CharStorage() to not load the file.

hhung516 commented 7 years ago

Thanks so much for the update! Glad you got it working.

I however took a frustrating detour. Modified CocoApi.lua to bypass torch.CharStorage() and directly loaded the json files with Lua, but ran into a memory issue where it tries to convert json to t7. Then I used the already converted t7 someone posted in #14 to get around that, only to find out LuaJIT runs out of memory (b/c it's 32-bit so confined with 2GB memory footprint, here). Right now I'm on Lua 5.1 but somehow messed up my torch. Still trying to get back on the right track...

Minilei89 commented 7 years ago

How is your torch messed up exactly? If you don't mind running an older version of torch with lua 5.2, everything should work fine if you git checkout with the version I posted. I'm not exactly sure if it's actually training or not so I'd like someone else to validate my method. If you need further clarification let me know.

hhung516 commented 7 years ago

Thanks again for the follow up. I really appreciate it!

So I did checkout the specific branch you mentioned, cleaned Torch and rebuilt with Lua 5.2 this time, and also reinstalled other libraries (cutorch, cunn, cudnn, coco, image, tds, nnx and optim). Now I'm stuck on computerProposal.

dh@PC:~/deepmask$ th computeProposals.lua $DEEPMASK/pretrained/deepmask /home/dh/torch/install/bin/lua: /home/eon/torch/install/share/lua/5.2/trepl/init.lua:389: /home/dh/torch/install/share/lua/5.2/trepl/init.lua:389: /home/dh/torch/install/share/lua/5.2/coco/MaskApi.lua:59: bad argument #1 to 'load' (string expected, got nil) stack traceback: [C]: in function 'error' /home/eon/torch/install/share/lua/5.2/trepl/init.lua:389: in function 'require' computeProposals.lua:37: in main chunk [C]: in function 'dofile' .../eon/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: in ?

I suspect the issue is related to coco not properly installed for torch, but still scratching my head...

Regarding the validation question you asked, although I have to say I'm very new to all these, but would you be able to run the computeProposals with your newly trained t7 on the same sample image (the dog), and see if it generates a different segmentation result (assuming you are training on the segmentation dataset)?

Anyway, would you be open to have a discussion in another channel? I could really use some pointers to get the training started. And I'd like to share what brought me to deepmask, and also to learn about the project you are working on, if it is ok.

Have a great weekend!

Minilei89 commented 7 years ago

I'm currently unable to manage training server atm due to internet restrictions for about a week or two; however I do know that Lua 5.2 will not be able to run the computeProposal and you will receive errors. Depending on whether you want to use the precomputed model or train your own, you will have to decide between Lua 5.2 and Luajit. I'd say we should try and get the computeProposal.th working for Lua 5.2 but it's probably easier to just break down the training sets into smaller sets. As for hosting discussion on another channel, that works for me.

hhung516 commented 7 years ago

OK, interesting to know computeProposal is broken under Lua 5.2. However I couldn't run the training either, essentially getting similar errors in both situations, due to messed up torch I think, even after I manually removed everything and started from scratch again.

I got error "module 'coco' not found:No LuaRocks module found for coco" for computeProposal, and "module 'inn' not found:No LuaRocks module found for inn" when doing the training.

One thing I haven't been able to try is to reboot the machine, for which I need physical access to it tomorrow. I'll probably try reinstalling Ubuntu as well during the week, likely end up spending less time overall. Will keep you posted on my progress.

Please kindly drop me a message via email and/or skype, and let's get the discussion going.

yeates commented 7 years ago

Hi, @Minilei89 ,I look through your method, and I don't understand git checkout 5961f52a65fe33efa675f71e5c19ad8de56e8dad mean just cd ~/torch and git checkout xxx ? or need install anything? I tried just git checkout xxx but the problem still happend

convert: ./data/annotations/instances_val2014.json --> .t7 [please be patient]  
/home/xavier/torch/install/bin/luajit: /home/xavier/torch/install/share/lua/5.1/coco/CocoApi.lua:142: Expected value but found T_END at character 1
stack traceback:
    [C]: in function 'decode'
    /home/xavier/torch/install/share/lua/5.1/coco/CocoApi.lua:142: in function '__convert'
    /home/xavier/torch/install/share/lua/5.1/coco/CocoApi.lua:128: in function '__init'
    /home/xavier/torch/install/share/lua/5.1/torch/init.lua:91: in function </home/xavier/torch/install/share/lua/5.1/torch/init.lua:87>
    [C]: in function 'CocoApi'
    ./loaders/loader.lua:46: in function 'DataLoader'
    /home/xavier/multipathnet/DataSetJSON.lua:44: in function 'create'
    demo.lua:92: in main chunk
    [C]: in function 'dofile'
    ...vier/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
    [C]: at 0x00405d50
Minilei89 commented 7 years ago

I've decided to leave a simple guide so if there are still questions lemme know and I'll make edits:

  1. You are going to want to install torch with Lua 5.2 to get around the GPU not enough memory error that occurs when running Luajit. To do this, run: "git clone https://github.com/torch/distro.git ~/torch --recursive" "cd ~/torch" and "git checkout 5961f52a65fe33efa675f71e5c19ad8de56e8dad".

Once you have changed to the older version, run: "./clean.sh" "bash install-deps". "TORCH_LUA_VERSION=LUA52 ./install.sh" which will install clean the previous files and install the dependencies needed for Torch with Lua 5.2.

  1. Once this is finished, check to make sure that the CharStorage() object correctly loads the data from its argument. Create a text file called "test.txt" and run 'echo "Hello World" > test.txt'. Run "lua". Type the following: > require 'torch' > x = torch.CharStorage('test.txt') > = x If CharStorage() is not bugged it will return: 72 101 108 108 111 32 87 111 114 108 100 10 [torch.CharStorage of size 12]

  2. From here you will need to install the luarocks packages. Run the following: "luarocks install cudnn" "luarocks install cunn" "luarocks install cutorch" "luarocks install inn" "luarocks install optim" "luarocks install nnx" "luarocks install tds" "luarocks install image"

Finally, clone the MS COCO API and cd into it. Run "luarocks make LuaAPI/rocks/coco-scm-1.rockspec coco/".

  1. Run lua again and make sure that CharStorage() is not broken. If not, cd into deepmask and try to run the training and evaluation.

Best of luck!

yeates commented 7 years ago

Thanks very a lot! Now I stuck in step of two, and if that's failed?

xavier@xavier-ThundeRobot:~$ cat hello.txt 
Hello World
xavier@xavier-ThundeRobot:~$ lua
Lua 5.2.3  Copyright (C) 1994-2013 Lua.org, PUC-Rio
> require 'torch'
> x = torch.CharStorage('hello.txt')
> =x
[torch.CharStorage of size 0]
Minilei89 commented 7 years ago

You will need to do a clean reinstall of torch. Did you perhaps run "luarocks install torch" at any point? I've done that before and had it such that it caused the CharStorage() bug to appear.

To remove torch, do "rm -rf ~/torch"

caxton commented 7 years ago

Hi @Minilei89 , why do you mention that install the dependencies needed for Torch with Lua 5.2 since computeProposal is broken under Lua 5.2? Should we just reinstall Torch using "./install.sh" instead?

Minilei89 commented 7 years ago

If you wish to do training, you may have memory issues when using LuaJIT as it is gated to 2GB of memory. You, however, do not need to use Lua5.2. It's best if you give it a try and if LuaJIT doesn't work then you may have to resort to Lua5.2.

Edit: The solution that I've concluded for this problem has been to use Lua5.2 to do the training while using LuaJIT to conduct the compute. Hhung516 also found that LuaJIT training will run if you decrease the batch number to 8. Best of luck!