facebookresearch / ParlAI

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
https://parl.ai
MIT License
10.49k stars 2.1k forks source link

Problems running the MemNN model #58

Closed chaitjo closed 7 years ago

chaitjo commented 7 years ago

Hello, I am trying to run the training for the MemNN model on the bAbI dialog tasks and receiving the following error. Can someone help?

(parlai) chait@chait:~/ParlAI/examples$ python memnn_luatorch_cpu/full_task_train.py --remote-cmd ~/torch/ -t dialog_babi:Task:1 -nt 8
[port:5555]
[datapath:/home/chait/ParlAI/data]
[parlai_home:/home/chait/ParlAI]
[datatype:train]
[download_path:/home/chait/ParlAI/downloads]
[numthreads:8]
[task:dialog_babi:Task:1]
[dict_language:english]
[num_examples:1000]
[remote_cmd:/home/chait/torch/]
[dict_nulltoken:<NULL>]
[dict_unktoken:<UNK>]
[remote_args:/home/chait/ParlAI/examples/memnn_luatorch_cpu/params_default.lua]
[dict_minfreq:0]
[num_its:100]
[batchsize:1]
[dict_max_ngram_size:-1]
Setting up dictionary.
[creating task(s): dialog_babi:Task:1]
[DialogTeacher initializing.]
[loading fbdialog data:/home/chait/ParlAI/data/dialog-bAbI/dialog-bAbI-tasks/dialog-babi-task1-API-calls-trn.txt]
[creating task(s): dialog_babi:Task:1]
[DialogTeacher initializing.]
[loading fbdialog data:/home/chait/ParlAI/data/dialog-bAbI/dialog-bAbI-tasks/dialog-babi-task1-API-calls-dev.txt]
Dictionary: saving dictionary to /tmp/dict.txt.
Dictionary ready, moving on to training.
Traceback (most recent call last):
  File "memnn_luatorch_cpu/full_task_train.py", line 104, in <module>
    main()
  File "memnn_luatorch_cpu/full_task_train.py", line 72, in main
    agent = ParsedRemoteAgent(opt, {'dictionary': dictionary})
  File "/home/chait/ParlAI/parlai/agents/remote_agent/agents.py", line 123, in __init__
    super().__init__(opt, shared)
  File "/home/chait/ParlAI/parlai/agents/remote_agent/agents.py", line 51, in __init__
    args=opt.get('remote_args', '')
  File "/home/chait/anaconda3/envs/parlai/lib/python3.5/subprocess.py", line 676, in __init__
    restore_signals, start_new_session)
  File "/home/chait/anaconda3/envs/parlai/lib/python3.5/subprocess.py", line 1282, in _execute_child
    raise child_exception_type(errno_num, err_msg)
PermissionError: [Errno 13] Permission denied

I saw this thread on stackoverflow, but was unable to solve the problem myself.

alexholdenmiller commented 7 years ago

Hi, thanks for trying this out! Can you try it again without overriding the remote-cmd flag? The full_task_train.py has a default one set, which is basically luajit parlai/agents/memnn_luatorch_cpu/memnn_zmq_parsed.lua.

Basically the issue here is that the remote-command needs to include the full command to launch your agent, including the actual lua file which is being run.

chaitjo commented 7 years ago

Hey! I did the override because the last time I ran it, it told me to either install luajit or manually set remote-cmd. On removing the flag, Its running but hasn't moved beyond a specific point for >5 mins. Here's what I see on my terminal-

/home/chait/torch/install/bin/luajit
luajit: ...downloads/memnnlib/KVmemnn/library/PositionalEncoder.lua:10: libmemnn.so: cannot open shared object file: No such file or directory
stack traceback:
    [C]: in function 'load'
    ...downloads/memnnlib/KVmemnn/library/PositionalEncoder.lua:10: in main chunk
    [C]: in function 'require'
    ...arlAI/downloads/memnnlib/KVmemnn/library/memnn_model.lua:15: in main chunk
    [C]: in function 'require'
    ...it/ParlAI/examples/memnn_luatorch_cpu/params_default.lua:16: in main chunk
    [C]: in function 'dofile'
    ...AI/parlai/agents/memnn_luatorch_cpu/memnn_zmq_parsed.lua:75: in main chunk
    [C]: at 0x00405d50
[dict_language:english]
[download_path:/home/chait/ParlAI/downloads]
[dict_unktoken:<UNK>]
[remote_args:/home/chait/ParlAI/examples/memnn_luatorch_cpu/params_default.lua]
[numthreads:8]
[port:5555]
[dict_minfreq:0]
[num_its:100]
[dict_nulltoken:<NULL>]
[datatype:train]
[datapath:/home/chait/ParlAI/data]
[dict_max_ngram_size:-1]
[task:dialog_babi:Task:1]
[num_examples:1000]
[parlai_home:/home/chait/ParlAI]
[batchsize:1]
[remote_cmd:luajit /home/chait/ParlAI/parlai/agents/memnn_luatorch_cpu/memnn_zmq_parsed.lua]
Setting up dictionary.
[creating task(s): dialog_babi:Task:1]
[DialogTeacher initializing.]
[loading fbdialog data:/home/chait/ParlAI/data/dialog-bAbI/dialog-bAbI-tasks/dialog-babi-task1-API-calls-trn.txt]
[creating task(s): dialog_babi:Task:1]
[DialogTeacher initializing.]
[loading fbdialog data:/home/chait/ParlAI/data/dialog-bAbI/dialog-bAbI-tasks/dialog-babi-task1-API-calls-dev.txt]
Dictionary: saving dictionary to /tmp/dict.txt.
Dictionary ready, moving on to training.
python thread connected to tcp://localhost:5555
[creating task(s): dialog_babi:Task:1]
[DialogTeacher initializing.]
[loading fbdialog data:/home/chait/ParlAI/data/dialog-bAbI/dialog-bAbI-tasks/dialog-babi-task1-API-calls-trn.txt]
[DialogTeacher initializing.]
python thread connected to tcp://localhost:5562
[DialogTeacher initializing.]
python thread connected to tcp://localhost:5563
[DialogTeacher initializing.]
python thread connected to tcp://localhost:5558
[DialogTeacher initializing.]
python thread connected to tcp://localhost:5559
[DialogTeacher initializing.]
python thread connected to tcp://localhost:5557
[DialogTeacher initializing.]
python thread connected to tcp://localhost:5556
[DialogTeacher initializing.]
python thread connected to tcp://localhost:5560
[DialogTeacher initializing.]
python thread connected to tcp://localhost:5561

Is there a way to know if the model is training? (None of the 8 python processes created seem to be using any CPU resources)

alexholdenmiller commented 7 years ago

Ah yes--the python threads are blocked because there was an issue launching the lua threads: at the top there, it had trouble opening a required library.

Try cd {ParlAI}/downloads/memnnlib/KVmemnn/; ./setup.sh

After running setup.sh, verify that everything is working correctly by running luajit -e "require 'library.PositionalEncoder'"--if that does not give an error, you're good.

Let me know if that works!

alexholdenmiller commented 7 years ago

There seems to be some weirdness with lua package paths sometimes (why I'm using luajit instead of th). The current setup works on our machines, but maybe it doesn't always.

Also make sure that which luajit points to ~/torch/install/bin/luajit.

chaitjo commented 7 years ago

Can't seem to be able to run setup.sh. Here's what I get-

Installing https://raw.githubusercontent.com/torch/rocks/master/tds-scm-1.rockspec...
Using https://raw.githubusercontent.com/torch/rocks/master/tds-scm-1.rockspec... switching to 'build' mode
Cloning into 'tds'...
error: cannot expand target domain name in SRV RR
fatal: Unable to look up github.com

Error: Failed cloning git repository.
Error: resolve then run luarocks install tds

Is there a way to manually do this?

which luajit points to the correct path.

alexholdenmiller commented 7 years ago

Ah yes, can you try running luarocks install tds from the command line yourself? There seems to be an issue with that part of the setup process.

chaitjo commented 7 years ago

Same output as setup.sh :(

alexholdenmiller commented 7 years ago

I was able to run the command on my machine. It says "unable to look up github.com" for you--do you have any reason to believe you might have network or DNS issues right now? That "cannot expand target domain name in SRV RR" error brings up DNS issues on a google search (no idea how to resolve that though...).

chaitjo commented 7 years ago

I'll try again and let you know.

nirmal070125 commented 7 years ago

It seemed to fail to install lzmq via luarocks install lzmq to me. Could you please explain what steps need to be done in order to get this example to work?

alexholdenmiller commented 7 years ago

Hi @nirmal070125, did you get the same error as above?

atav32 commented 7 years ago

I guess I'm in this bucket now too.

~/code/ParlAI> py3 examples/memnn_luatorch_cpu/full_task_train.py -t babi:tas
k10k:1 -nt 8
Traceback (most recent call last):
  File "examples/memnn_luatorch_cpu/full_task_train.py", line 117, in <module>
    main()
  File "examples/memnn_luatorch_cpu/full_task_train.py", line 42, in main
    'or manually set --remote-cmd for this example.')
RuntimeError: Could not detect torch luajit installed: please install torch from http://torch.c
h or manually set --remote-cmd for this example.

added --remote-cmd flag

~/code/ParlAI> py3 examples/memnn_luatorch_cpu/full_task_train.py -t babi:tas
k10k:1 -nt 8 --remote-cmd parlai/agents/memnn_luatorch_cpu/memnn_zmq_parsed.lua
[no_images:False]
[remote_cmd:parlai/agents/memnn_luatorch_cpu/memnn_zmq_parsed.lua]
[batchsize:1]
[dict_unktoken:<UNK>]
[dict_max_exs:10000]
[datapath:/Users/brian/code/ParlAI/data]
[datatype:train]
[num_examples:1000]
[dict_language:english]
[numthreads:8]
[parlai_home:/Users/brian/code/ParlAI]
[dict_minfreq:0]
[remote_args:/Users/brian/code/ParlAI/examples/memnn_luatorch_cpu/params_default.lua]
[dict_max_ngram_size:-1]
[task:babi:task10k:1]
[download_path:/Users/brian/code/ParlAI/downloads]
[dict_nulltoken:<NULL>]
[port:5555]
[num_its:100]
Setting up dictionary.
[nltk_data] Downloading package punkt to /Users/brian/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[creating task(s): babi:task10k:1]
[DialogTeacher initializing.]
[loading fbdialog data:/Users/brian/code/ParlAI/data/bAbI/tasks_1-20_v1-2/en-valid-10k-nosf/qa1
_train.txt]
Dictionary building on train:ordered data.
[creating task(s): babi:task10k:1]
[DialogTeacher initializing.]
[loading fbdialog data:/Users/brian/code/ParlAI/data/bAbI/tasks_1-20_v1-2/en-valid-10k-nosf/qa1
_valid.txt]
Dictionary building on valid data.
Dictionary: saving dictionary to /tmp/dict.txt.
Dictionary ready, moving on to training.
Traceback (most recent call last):
  File "examples/memnn_luatorch_cpu/full_task_train.py", line 117, in <module>
    main()
  File "examples/memnn_luatorch_cpu/full_task_train.py", line 84, in main
    agent = ParsedRemoteAgent(opt, {'dictionary': dictionary})
  File "/Users/brian/code/ParlAI/parlai/agents/remote_agent/agents.py", line 126, in __init__
    super().__init__(opt, shared)
  File "/Users/brian/code/ParlAI/parlai/agents/remote_agent/agents.py", line 51, in __init__
    args=opt.get('remote_args', '')
  File "/Users/brian/.pyenv/versions/3.5.0/lib/python3.5/subprocess.py", line 950, in __init__
    restore_signals, start_new_session)
  File "/Users/brian/.pyenv/versions/3.5.0/lib/python3.5/subprocess.py", line 1540, in _execute
_child
    raise child_exception_type(errno_num, err_msg)
PermissionError: [Errno 13] Permission denied

then tried sudoing it; same error.


Tested my LuaJIT installation and that showed an error - details.

alexholdenmiller commented 7 years ago

Thanks for the report @atav32!

1) do you have torch installed and up to date? 2) note that the remote cmd needs to include luajit (e.g. --remote-cmd "luajit parlai/agents/memnn_luatorch_cpu/memnn_zmq_parsed.lua"), which was just there in case you had a funky installation of torch/luajit and wanted to provide the path to your torch's luajit command

alexholdenmiller commented 7 years ago

The "permission denied" error there is because the os tried to run: parlai/agents/memnn_luatorch_cpu/memnn_zmq_parsed.lua which doesn't have executable permissions (and shouldn't be run directly as a bash command, naturally)

atav32 commented 7 years ago

@alexholdenmiller Thanks for advice!

  1. Torch is installed and up to date
  2. Of course! Brain fart on not using the luajit command in the --remote-cmd flag.

Running that got me a step further and told me ZMQ wasn't found (I forgot to copy down the error log). Then I installed ZMQ (brew install zmq) and LZMQ (luarocks install lzmq) and I think everything's working.


Does this look right?

~/code/ParlAI> py3 examples/memnn_luatorch_cpu/full_task_train.py -t babi:task10k:1
-nt 8 --remote-cmd "luajit parlai/agents/memnn_luatorch_cpu/memnn_zmq_parsed.lua"
[dict_max_ngram_size:-1]
[dict_nulltoken:<NULL>]
[download_path:/Users/brian/code/ParlAI/downloads]
[numthreads:8]
[batchsize:1]
[parlai_home:/Users/brian/code/ParlAI]
[datatype:train]
[port:5555]
[num_its:100]
[dict_unktoken:<UNK>]
[no_images:False]
[task:babi:task10k:1]
[dict_max_exs:10000]
[dict_language:english]
[datapath:/Users/brian/code/ParlAI/data]
[remote_cmd:luajit parlai/agents/memnn_luatorch_cpu/memnn_zmq_parsed.lua]
[dict_minfreq:0]
[remote_args:/Users/brian/code/ParlAI/examples/memnn_luatorch_cpu/params_default.lua]
[num_examples:1000]
Setting up dictionary.
[creating task(s): babi:task10k:1]
[DialogTeacher initializing.]
[loading fbdialog data:/Users/brian/code/ParlAI/data/bAbI/tasks_1-20_v1-2/en-valid-10k-nosf/qa1
_train.txt]
Dictionary building on train:ordered data.
[creating task(s): babi:task10k:1]
[DialogTeacher initializing.]
[loading fbdialog data:/Users/brian/code/ParlAI/data/bAbI/tasks_1-20_v1-2/en-valid-10k-nosf/qa1
_valid.txt]
Dictionary building on valid data.
Dictionary: saving dictionary to /tmp/dict.txt.
Dictionary ready, moving on to training.
python thread connected to tcp://localhost:5555
[creating task(s): babi:task10k:1]
[DialogTeacher initializing.]
[loading fbdialog data:/Users/brian/code/ParlAI/data/bAbI/tasks_1-20_v1-2/en-valid-10k-nosf/qa1
_train.txt]
luajit: ...downloads/memnnlib/KVmemnn/library/PositionalEncoder.lua:10: dlopen(libmemnn.so, 5):
 image not found
stack traceback:
        [C]: in function 'load'
        ...downloads/memnnlib/KVmemnn/library/PositionalEncoder.lua:10: in main chunk
        [C]: in function 'require'
        ...arlAI/downloads/memnnlib/KVmemnn/library/memnn_model.lua:15: in main chunk
        [C]: in function 'require'
        ...de/ParlAI/examples/memnn_luatorch_cpu/params_default.lua:16: in main chunk
        [C]: in function 'dofile'
        parlai/agents/memnn_luatorch_cpu/memnn_zmq_parsed.lua:75: in main chunk
        [C]: at 0x01000016c0
[DialogTeacher initializing.]
[creating task(s): babi:task10k:1]
[DialogTeacher initializing.]
[DialogTeacher initializing.]
[DialogTeacher initializing.]
[DialogTeacher initializing.]
[DialogTeacher initializing.]
[DialogTeacher initializing.]
[DialogTeacher initializing.]
[loading fbdialog data:/Users/brian/code/ParlAI/data/bAbI/tasks_1-20_v1-2/en-valid-10k-nosf/qa1
_valid.txt]
[DialogTeacher initializing.]
python thread connected to tcp://localhost:5556
[ training ]
python thread connected to tcp://localhost:5559
python thread connected to tcp://localhost:5557
python thread connected to tcp://localhost:5560
python thread connected to tcp://localhost:5558
python thread connected to tcp://localhost:5561
python thread connected to tcp://localhost:5563
python thread connected to tcp://localhost:5562

If so, what should I expect next?

nirmal070125 commented 7 years ago

@alexholdenmiller I got the error solved by installing ZMQ via brew. Now, I'm also in the same state as @atav32

alexholdenmiller commented 7 years ago

@atav32 the python side is all good and the main lua command ran, but it failed to access the memnn C library functions.

Can you try going to the downloads/memnnlib/KVmemnn/ folder and running ./setup.sh?

If you look at the file, you can see part of it is making the libmemnn.so file and symlinking it to your torch lib directory. Can you check that file made by that isn't empty? (you can do that with locate libmemnn.so to find where it put it, and cat the file to make sure you get a bunch of binary rather than an error.

nirmal070125 commented 7 years ago

@alexholdenmiller awesome! that worked! thanks.

chaitjo commented 7 years ago

@alexholdenmiller I was able to run setup.sh after switching my network. Its training now.