Closed chaitjo closed 7 years ago
Hi, thanks for trying this out! Can you try it again without overriding the remote-cmd flag? The full_task_train.py
has a default one set, which is basically luajit parlai/agents/memnn_luatorch_cpu/memnn_zmq_parsed.lua
.
Basically the issue here is that the remote-command needs to include the full command to launch your agent, including the actual lua file which is being run.
Hey! I did the override because the last time I ran it, it told me to either install luajit or manually set remote-cmd. On removing the flag, Its running but hasn't moved beyond a specific point for >5 mins. Here's what I see on my terminal-
/home/chait/torch/install/bin/luajit
luajit: ...downloads/memnnlib/KVmemnn/library/PositionalEncoder.lua:10: libmemnn.so: cannot open shared object file: No such file or directory
stack traceback:
[C]: in function 'load'
...downloads/memnnlib/KVmemnn/library/PositionalEncoder.lua:10: in main chunk
[C]: in function 'require'
...arlAI/downloads/memnnlib/KVmemnn/library/memnn_model.lua:15: in main chunk
[C]: in function 'require'
...it/ParlAI/examples/memnn_luatorch_cpu/params_default.lua:16: in main chunk
[C]: in function 'dofile'
...AI/parlai/agents/memnn_luatorch_cpu/memnn_zmq_parsed.lua:75: in main chunk
[C]: at 0x00405d50
[dict_language:english]
[download_path:/home/chait/ParlAI/downloads]
[dict_unktoken:<UNK>]
[remote_args:/home/chait/ParlAI/examples/memnn_luatorch_cpu/params_default.lua]
[numthreads:8]
[port:5555]
[dict_minfreq:0]
[num_its:100]
[dict_nulltoken:<NULL>]
[datatype:train]
[datapath:/home/chait/ParlAI/data]
[dict_max_ngram_size:-1]
[task:dialog_babi:Task:1]
[num_examples:1000]
[parlai_home:/home/chait/ParlAI]
[batchsize:1]
[remote_cmd:luajit /home/chait/ParlAI/parlai/agents/memnn_luatorch_cpu/memnn_zmq_parsed.lua]
Setting up dictionary.
[creating task(s): dialog_babi:Task:1]
[DialogTeacher initializing.]
[loading fbdialog data:/home/chait/ParlAI/data/dialog-bAbI/dialog-bAbI-tasks/dialog-babi-task1-API-calls-trn.txt]
[creating task(s): dialog_babi:Task:1]
[DialogTeacher initializing.]
[loading fbdialog data:/home/chait/ParlAI/data/dialog-bAbI/dialog-bAbI-tasks/dialog-babi-task1-API-calls-dev.txt]
Dictionary: saving dictionary to /tmp/dict.txt.
Dictionary ready, moving on to training.
python thread connected to tcp://localhost:5555
[creating task(s): dialog_babi:Task:1]
[DialogTeacher initializing.]
[loading fbdialog data:/home/chait/ParlAI/data/dialog-bAbI/dialog-bAbI-tasks/dialog-babi-task1-API-calls-trn.txt]
[DialogTeacher initializing.]
python thread connected to tcp://localhost:5562
[DialogTeacher initializing.]
python thread connected to tcp://localhost:5563
[DialogTeacher initializing.]
python thread connected to tcp://localhost:5558
[DialogTeacher initializing.]
python thread connected to tcp://localhost:5559
[DialogTeacher initializing.]
python thread connected to tcp://localhost:5557
[DialogTeacher initializing.]
python thread connected to tcp://localhost:5556
[DialogTeacher initializing.]
python thread connected to tcp://localhost:5560
[DialogTeacher initializing.]
python thread connected to tcp://localhost:5561
Is there a way to know if the model is training? (None of the 8 python processes created seem to be using any CPU resources)
Ah yes--the python threads are blocked because there was an issue launching the lua threads: at the top there, it had trouble opening a required library.
Try cd {ParlAI}/downloads/memnnlib/KVmemnn/; ./setup.sh
After running setup.sh, verify that everything is working correctly by running luajit -e "require 'library.PositionalEncoder'"
--if that does not give an error, you're good.
Let me know if that works!
There seems to be some weirdness with lua package paths sometimes (why I'm using luajit
instead of th
). The current setup works on our machines, but maybe it doesn't always.
Also make sure that which luajit
points to ~/torch/install/bin/luajit.
Can't seem to be able to run setup.sh. Here's what I get-
Installing https://raw.githubusercontent.com/torch/rocks/master/tds-scm-1.rockspec...
Using https://raw.githubusercontent.com/torch/rocks/master/tds-scm-1.rockspec... switching to 'build' mode
Cloning into 'tds'...
error: cannot expand target domain name in SRV RR
fatal: Unable to look up github.com
Error: Failed cloning git repository.
Error: resolve then run luarocks install tds
Is there a way to manually do this?
which luajit
points to the correct path.
Ah yes, can you try running luarocks install tds
from the command line yourself? There seems to be an issue with that part of the setup process.
Same output as setup.sh :(
I was able to run the command on my machine. It says "unable to look up github.com" for you--do you have any reason to believe you might have network or DNS issues right now? That "cannot expand target domain name in SRV RR" error brings up DNS issues on a google search (no idea how to resolve that though...).
I'll try again and let you know.
It seemed to fail to install lzmq via luarocks install lzmq
to me. Could you please explain what steps need to be done in order to get this example to work?
Hi @nirmal070125, did you get the same error as above?
I guess I'm in this bucket now too.
~/code/ParlAI> py3 examples/memnn_luatorch_cpu/full_task_train.py -t babi:tas
k10k:1 -nt 8
Traceback (most recent call last):
File "examples/memnn_luatorch_cpu/full_task_train.py", line 117, in <module>
main()
File "examples/memnn_luatorch_cpu/full_task_train.py", line 42, in main
'or manually set --remote-cmd for this example.')
RuntimeError: Could not detect torch luajit installed: please install torch from http://torch.c
h or manually set --remote-cmd for this example.
added --remote-cmd
flag
~/code/ParlAI> py3 examples/memnn_luatorch_cpu/full_task_train.py -t babi:tas
k10k:1 -nt 8 --remote-cmd parlai/agents/memnn_luatorch_cpu/memnn_zmq_parsed.lua
[no_images:False]
[remote_cmd:parlai/agents/memnn_luatorch_cpu/memnn_zmq_parsed.lua]
[batchsize:1]
[dict_unktoken:<UNK>]
[dict_max_exs:10000]
[datapath:/Users/brian/code/ParlAI/data]
[datatype:train]
[num_examples:1000]
[dict_language:english]
[numthreads:8]
[parlai_home:/Users/brian/code/ParlAI]
[dict_minfreq:0]
[remote_args:/Users/brian/code/ParlAI/examples/memnn_luatorch_cpu/params_default.lua]
[dict_max_ngram_size:-1]
[task:babi:task10k:1]
[download_path:/Users/brian/code/ParlAI/downloads]
[dict_nulltoken:<NULL>]
[port:5555]
[num_its:100]
Setting up dictionary.
[nltk_data] Downloading package punkt to /Users/brian/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
[creating task(s): babi:task10k:1]
[DialogTeacher initializing.]
[loading fbdialog data:/Users/brian/code/ParlAI/data/bAbI/tasks_1-20_v1-2/en-valid-10k-nosf/qa1
_train.txt]
Dictionary building on train:ordered data.
[creating task(s): babi:task10k:1]
[DialogTeacher initializing.]
[loading fbdialog data:/Users/brian/code/ParlAI/data/bAbI/tasks_1-20_v1-2/en-valid-10k-nosf/qa1
_valid.txt]
Dictionary building on valid data.
Dictionary: saving dictionary to /tmp/dict.txt.
Dictionary ready, moving on to training.
Traceback (most recent call last):
File "examples/memnn_luatorch_cpu/full_task_train.py", line 117, in <module>
main()
File "examples/memnn_luatorch_cpu/full_task_train.py", line 84, in main
agent = ParsedRemoteAgent(opt, {'dictionary': dictionary})
File "/Users/brian/code/ParlAI/parlai/agents/remote_agent/agents.py", line 126, in __init__
super().__init__(opt, shared)
File "/Users/brian/code/ParlAI/parlai/agents/remote_agent/agents.py", line 51, in __init__
args=opt.get('remote_args', '')
File "/Users/brian/.pyenv/versions/3.5.0/lib/python3.5/subprocess.py", line 950, in __init__
restore_signals, start_new_session)
File "/Users/brian/.pyenv/versions/3.5.0/lib/python3.5/subprocess.py", line 1540, in _execute
_child
raise child_exception_type(errno_num, err_msg)
PermissionError: [Errno 13] Permission denied
then tried sudoing it; same error.
Tested my LuaJIT installation and that showed an error - details.
Thanks for the report @atav32!
1) do you have torch installed and up to date?
2) note that the remote cmd needs to include luajit (e.g. --remote-cmd "luajit parlai/agents/memnn_luatorch_cpu/memnn_zmq_parsed.lua"
), which was just there in case you had a funky installation of torch/luajit and wanted to provide the path to your torch's luajit command
The "permission denied" error there is because the os tried to run:
parlai/agents/memnn_luatorch_cpu/memnn_zmq_parsed.lua
which doesn't have executable permissions (and shouldn't be run directly as a bash command, naturally)
@alexholdenmiller Thanks for advice!
luajit
command in the --remote-cmd
flag. Running that got me a step further and told me ZMQ wasn't found (I forgot to copy down the error log). Then I installed ZMQ (brew install zmq
) and LZMQ (luarocks install lzmq
) and I think everything's working.
Does this look right?
~/code/ParlAI> py3 examples/memnn_luatorch_cpu/full_task_train.py -t babi:task10k:1
-nt 8 --remote-cmd "luajit parlai/agents/memnn_luatorch_cpu/memnn_zmq_parsed.lua"
[dict_max_ngram_size:-1]
[dict_nulltoken:<NULL>]
[download_path:/Users/brian/code/ParlAI/downloads]
[numthreads:8]
[batchsize:1]
[parlai_home:/Users/brian/code/ParlAI]
[datatype:train]
[port:5555]
[num_its:100]
[dict_unktoken:<UNK>]
[no_images:False]
[task:babi:task10k:1]
[dict_max_exs:10000]
[dict_language:english]
[datapath:/Users/brian/code/ParlAI/data]
[remote_cmd:luajit parlai/agents/memnn_luatorch_cpu/memnn_zmq_parsed.lua]
[dict_minfreq:0]
[remote_args:/Users/brian/code/ParlAI/examples/memnn_luatorch_cpu/params_default.lua]
[num_examples:1000]
Setting up dictionary.
[creating task(s): babi:task10k:1]
[DialogTeacher initializing.]
[loading fbdialog data:/Users/brian/code/ParlAI/data/bAbI/tasks_1-20_v1-2/en-valid-10k-nosf/qa1
_train.txt]
Dictionary building on train:ordered data.
[creating task(s): babi:task10k:1]
[DialogTeacher initializing.]
[loading fbdialog data:/Users/brian/code/ParlAI/data/bAbI/tasks_1-20_v1-2/en-valid-10k-nosf/qa1
_valid.txt]
Dictionary building on valid data.
Dictionary: saving dictionary to /tmp/dict.txt.
Dictionary ready, moving on to training.
python thread connected to tcp://localhost:5555
[creating task(s): babi:task10k:1]
[DialogTeacher initializing.]
[loading fbdialog data:/Users/brian/code/ParlAI/data/bAbI/tasks_1-20_v1-2/en-valid-10k-nosf/qa1
_train.txt]
luajit: ...downloads/memnnlib/KVmemnn/library/PositionalEncoder.lua:10: dlopen(libmemnn.so, 5):
image not found
stack traceback:
[C]: in function 'load'
...downloads/memnnlib/KVmemnn/library/PositionalEncoder.lua:10: in main chunk
[C]: in function 'require'
...arlAI/downloads/memnnlib/KVmemnn/library/memnn_model.lua:15: in main chunk
[C]: in function 'require'
...de/ParlAI/examples/memnn_luatorch_cpu/params_default.lua:16: in main chunk
[C]: in function 'dofile'
parlai/agents/memnn_luatorch_cpu/memnn_zmq_parsed.lua:75: in main chunk
[C]: at 0x01000016c0
[DialogTeacher initializing.]
[creating task(s): babi:task10k:1]
[DialogTeacher initializing.]
[DialogTeacher initializing.]
[DialogTeacher initializing.]
[DialogTeacher initializing.]
[DialogTeacher initializing.]
[DialogTeacher initializing.]
[DialogTeacher initializing.]
[loading fbdialog data:/Users/brian/code/ParlAI/data/bAbI/tasks_1-20_v1-2/en-valid-10k-nosf/qa1
_valid.txt]
[DialogTeacher initializing.]
python thread connected to tcp://localhost:5556
[ training ]
python thread connected to tcp://localhost:5559
python thread connected to tcp://localhost:5557
python thread connected to tcp://localhost:5560
python thread connected to tcp://localhost:5558
python thread connected to tcp://localhost:5561
python thread connected to tcp://localhost:5563
python thread connected to tcp://localhost:5562
If so, what should I expect next?
@alexholdenmiller I got the error solved by installing ZMQ via brew. Now, I'm also in the same state as @atav32
@atav32 the python side is all good and the main lua command ran, but it failed to access the memnn C library functions.
Can you try going to the downloads/memnnlib/KVmemnn/
folder and running ./setup.sh
?
If you look at the file, you can see part of it is making the libmemnn.so file and symlinking it to your torch lib directory. Can you check that file made by that isn't empty?
(you can do that with locate libmemnn.so
to find where it put it, and cat
the file to make sure you get a bunch of binary rather than an error.
@alexholdenmiller awesome! that worked! thanks.
@alexholdenmiller I was able to run setup.sh
after switching my network. Its training now.
Hello, I am trying to run the training for the MemNN model on the bAbI dialog tasks and receiving the following error. Can someone help?
I saw this thread on stackoverflow, but was unable to solve the problem myself.