Closed nikefd closed 7 years ago
Hi @nikefd, have a look at scripts/unkreplace.lua. You should run it on a single gen-b*.txt file. It also needs the original source language data and an alignment dictionary. You can create the dictionary with scripts/makealigndict.lua, which in turn requires an alignment file that's generated with scripts/build_sym_alignment.py.
I understand that this is a little involved -- if you have any questions along the way please don't hesitate to ask!
Ok, it is really helpful! Thanks!
Hi, Jonas. I build fast_align and mosesdecoder (I use Moses RELEASE-2.1 packages).
And I ran the command following:
python build_sym_alignment.py --fast_align_dir ~/download/fast_align/build/ --mosesdecoder_dir /opt/moses/ --source_file ../data/qqData/train.x --target_file ../data/qqData/train.y --output_dir dict
The train.x and train.y are both 1M sentences.
After run the command, I got four files align.backward aligned.grow-diag-final-and align.forward text.joined
. But I think the file aligned.grow-diag-final-and
is not completed.
And there is no aligned.sym_heuristic
file.
And got an Segmentation fault error in computing grow alignment.
symal: computing grow alignment: diagonal (1) final (1)both-uncovered (1)
*** Segmentation fault
Register dump:
RAX: 0000000000000010 RBX: 00007ffee1b95b70 RCX: 00007ffee1b95b70
RDX: 0000000000000000 RSI: 00007ffee1b95b70 RDI: 00007ffee1b923c3
RBP: 00007ffee1b95b74 R8 : 00000000ffffffff R9 : 00007ffee1b95c78
R10: 00007f5231e3c940 R11: 0000000000000246 R12: 00007ffee1b923c3
R13: 00007ffee1b95b74 R14: 00007ffee1b92410 R15: 0000000000000397
RSP: 00007ffee1b92380
RIP: 00007f5231bb7f45 EFLAGS: 00010202
CS: 0033 FS: 0000 GS: 0000
Trap: 0000000e Error: 00000005 OldMask: 00000000 CR2: fffffff8
FPUCW: 0000037f FPUSW: 00000000 TAG: 00000000
RIP: 00000000 RDP: 00000000
ST(0) 0000 0000000000000000 ST(1) 0000 0000000000000000
ST(2) 0000 0000000000000000 ST(3) 0000 0000000000000000
ST(4) 0000 0000000000000000 ST(5) 0000 0000000000000000
ST(6) 0000 0000000000000000 ST(7) 0000 0000000000000000
mxcsr: 1f80
XMM0: 00000000000000000000000000000000 XMM1: 00000000000000000000000000000000
XMM2: 00000000000000000000000000000000 XMM3: 00000000000000000000000000000000
XMM4: 00000000000000000000000000000000 XMM5: 00000000000000000000000000000000
XMM6: 00000000000000000000000000000000 XMM7: 00000000000000000000000000000000
XMM8: 00000000000000000000000000000000 XMM9: 00000000000000000000000000000000
XMM10: 00000000000000000000000000000000 XMM11: 00000000000000000000000000000000
XMM12: 00000000000000000000000000000000 XMM13: 00000000000000000000000000000000
XMM14: 00000000000000000000000000000000 XMM15: 00000000000000000000000000000000
Backtrace:
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSi6sentryC1ERSib+0x15)[0x7f5231bb7f45]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSirsERi+0x1b)[0x7f5231bb823b]
/opt/moses/bin/symal[0x402c97]
/opt/moses/bin/symal[0x405cca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f523105cf45]
/opt/moses/bin/symal[0x402359]
Memory map:
00400000-00409000 r-xp 00000000 08:03 178625678 /opt/moses/bin/symal
00608000-00609000 rw-p 00008000 08:03 178625678 /opt/moses/bin/symal
00609000-0060c000 rw-p 00000000 00:00 0
018b5000-01979000 rw-p 00000000 00:00 0 [heap]
7f523103b000-7f52311f5000 r-xp 00000000 08:03 2173831 /lib/x86_64-linux-gnu/libc-2.19.so
7f52311f5000-7f52313f5000 ---p 001ba000 08:03 2173831 /lib/x86_64-linux-gnu/libc-2.19.so
7f52313f5000-7f52313f9000 r--p 001ba000 08:03 2173831 /lib/x86_64-linux-gnu/libc-2.19.so
7f52313f9000-7f52313fb000 rw-p 001be000 08:03 2173831 /lib/x86_64-linux-gnu/libc-2.19.so
7f52313fb000-7f5231400000 rw-p 00000000 00:00 0
7f5231400000-7f5231419000 r-xp 00000000 08:03 2173823 /lib/x86_64-linux-gnu/libpthread-2.19.so
7f5231419000-7f5231618000 ---p 00019000 08:03 2173823 /lib/x86_64-linux-gnu/libpthread-2.19.so
7f5231618000-7f5231619000 r--p 00018000 08:03 2173823 /lib/x86_64-linux-gnu/libpthread-2.19.so
7f5231619000-7f523161a000 rw-p 00019000 08:03 2173823 /lib/x86_64-linux-gnu/libpthread-2.19.so
7f523161a000-7f523161e000 rw-p 00000000 00:00 0
7f523161e000-7f5231634000 r-xp 00000000 08:03 2171329 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f5231634000-7f5231833000 ---p 00016000 08:03 2171329 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f5231833000-7f5231834000 rw-p 00015000 08:03 2171329 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f5231834000-7f5231939000 r-xp 00000000 08:03 2173816 /lib/x86_64-linux-gnu/libm-2.19.so
7f5231939000-7f5231b38000 ---p 00105000 08:03 2173816 /lib/x86_64-linux-gnu/libm-2.19.so
7f5231b38000-7f5231b39000 r--p 00104000 08:03 2173816 /lib/x86_64-linux-gnu/libm-2.19.so
7f5231b39000-7f5231b3a000 rw-p 00105000 08:03 2173816 /lib/x86_64-linux-gnu/libm-2.19.so
7f5231b3a000-7f5231c20000 r-xp 00000000 08:03 223758615 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.19
7f5231c20000-7f5231e1f000 ---p 000e6000 08:03 223758615 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.19
7f5231e1f000-7f5231e27000 r--p 000e5000 08:03 223758615 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.19
7f5231e27000-7f5231e29000 rw-p 000ed000 08:03 223758615 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.19
7f5231e29000-7f5231e3e000 rw-p 00000000 00:00 0
7f5231e3e000-7f5231e45000 r-xp 00000000 08:03 2173830 /lib/x86_64-linux-gnu/librt-2.19.so
7f5231e45000-7f5232044000 ---p 00007000 08:03 2173830 /lib/x86_64-linux-gnu/librt-2.19.so
7f5232044000-7f5232045000 r--p 00006000 08:03 2173830 /lib/x86_64-linux-gnu/librt-2.19.so
7f5232045000-7f5232046000 rw-p 00007000 08:03 2173830 /lib/x86_64-linux-gnu/librt-2.19.so
7f5232046000-7f523204a000 r-xp 00000000 08:03 2173825 /lib/x86_64-linux-gnu/libSegFault.so
7f523204a000-7f5232249000 ---p 00004000 08:03 2173825 /lib/x86_64-linux-gnu/libSegFault.so
7f5232249000-7f523224a000 r--p 00003000 08:03 2173825 /lib/x86_64-linux-gnu/libSegFault.so
7f523224a000-7f523224b000 rw-p 00004000 08:03 2173825 /lib/x86_64-linux-gnu/libSegFault.so
7f523224b000-7f523224e000 r-xp 00000000 08:03 2173818 /lib/x86_64-linux-gnu/libdl-2.19.so
7f523224e000-7f523244d000 ---p 00003000 08:03 2173818 /lib/x86_64-linux-gnu/libdl-2.19.so
7f523244d000-7f523244e000 r--p 00002000 08:03 2173818 /lib/x86_64-linux-gnu/libdl-2.19.so
7f523244e000-7f523244f000 rw-p 00003000 08:03 2173818 /lib/x86_64-linux-gnu/libdl-2.19.so
7f523244f000-7f5232472000 r-xp 00000000 08:03 2173824 /lib/x86_64-linux-gnu/ld-2.19.so
7f523264e000-7f5232655000 rw-p 00000000 00:00 0
7f523266f000-7f5232671000 rw-p 00000000 00:00 0
7f5232671000-7f5232672000 r--p 00022000 08:03 2173824 /lib/x86_64-linux-gnu/ld-2.19.so
7f5232672000-7f5232673000 rw-p 00023000 08:03 2173824 /lib/x86_64-linux-gnu/ld-2.19.so
7f5232673000-7f5232674000 rw-p 00000000 00:00 0
7ffee1b76000-7ffee1b97000 rw-p 00000000 00:00 0 [stack]
7ffee1bf5000-7ffee1bf7000 r--p 00000000 00:00 0 [vvar]
7ffee1bf7000-7ffee1bf9000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
sh: line 1: 33675 Segmentation fault (core dumped) /opt/moses/bin/symal -alignment="grow" -diagonal="yes" -final="yes" -both="yes" > dict/aligned.grow-diag-final-and
I search on google and can't find a way to deal with it. I think maybe the version problem? So I tried the Moses RELEASE-1.0 and got a different error.
sh: /opt/moses/scripts/ems/support/symmetrize-fast-align.perl: No such file or directory
Traceback (most recent call last):
File "build_sym_alignment.py", line 110, in <module>
main()
File "build_sym_alignment.py", line 106, in main
assert os.system(sym_cmd) == 0
AssertionError
And I tried build moses form source. But also got the following error.
symal: computing grow alignment: diagonal (1) final (1)both-uncovered (1)
sh: line 1: 5421 Segmentation fault (core dumped) /home/nikefd/download/mosesdecoder/bin/symal -alignment="grow" -diagonal="yes" -final="yes" -both="yes" > dict/aligned.grow-diag-final-and
So the BLEU score we got in the fair train ...
is not the final score? Because it is ahead of processing of unk replacement.
So the BLEU score we got in the
fair train ...
is not the final score? Because it is ahead of processing of unk replacement.
Yes, that's true. If you're using BPE, the BLEU score reported will also be over BPE tokens and not over actual words.
Regarding your error, did you try the Moses version from Github? I haven't seen that error before (maybe @michaelauli has?). You should probably try asking on the Moses mailing list (see moses-smt/mosesdecoder#160).
Ok, I will asking this question in the moses mailing list. Thanks!
Hi, Jonas. I found if I set the limit the sentences length to 50 words. I can run it success. Thanks! So I have built the alignment file success.
But I have another question.
How can I use the file makealigndict.lua
and unkreplace.lua
?
I tried fariseq makealigndict ...
and failed.
module 'fairseq.scripts.makealigndict' not found:No LuaRocks module found for fairseq.scripts.makealigndict
I guess that I should modify the file CMakeLists.txt
.
# Scripts and main executable
FOREACH(SCRIPT preprocess train tofloat generate generate-lines score optimize-fconv help)
INSTALL(FILES "${SCRIPT}.lua" DESTINATION "${ROCKS_LUADIR}/fairseq/scripts")
ENDFOREACH(SCRIPT)
Add scripts/makealigndict scripts/unkreplace here and make it again.
# Scripts and main executable
FOREACH(SCRIPT preprocess train tofloat generate generate-lines score optimize-fconv help scripts/makealigndict scripts/unkreplace)
INSTALL(FILES "${SCRIPT}.lua" DESTINATION "${ROCKS_LUADIR}/fairseq/scripts")
ENDFOREACH(SCRIPT)
The files in scripts/ are helper scripts for specific setups and are thus not part of the main set of tools. Once fairseq has been installed, it's easy to run them via th
. For example, if you use sub-word units like BPE codes then you'll likely never need those scripts. Hence, I'd like to keep them separate.
Ok, got it. But I am confused about how to run them via th
?
You should be able to run th scripts/unkreplace.lua -help
from your shell prompt.
Thanks, I met a problem when I ran th scripts/unkreplace.lua, the error info is as below:
argument 1 expected a 'string', got a 'nil'
I found that the index of some sentences in the gen-b*.txt are the same!
I have serval same index
170649 S-52 the cast of big valley where are they now
170650 T-52 big valley cast members
170651 H-52 -0.325046 big valley where are they now
170652 A-52 4 5 6 7 8 9 10
...
170777 S-52 how much if i make a small bathroom <unk>
170778 T-52 how much does it cost to remodel a small bathroom
170779 H-52 -0.655210 how much do you make a small bathroom
170780 A-52 8 2 3 8 5 6 7 8 9
The latter's index should be 818.
Hi, Jonas. @jgehring Another question. In the process of makealigndict, I use
th scripts/makealigndict.lua
to make aligndict. When I use a dataset of 1M, it work well.
When I use a dataset of 40M, it ran into a error as below
Processed 37575000 sentences
Processed 37600000 sentences
Processed 37625000 sentences
/home/myname/torch/install/bin/luajit: /home/myname/torch/install/share/lua/5.1/tds/hash.lua:76: hash index is nil
stack traceback:
[C]: in function 'assert'
/home/myname/torch/install/share/lua/5.1/tds/hash.lua:76: in function '__index'
scripts/makealigndict.lua:55: in main chunk
[C]: in function 'dofile'
.../torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00406670
I try to find the problem but failed. I look into hash.lua:76
function hash:__index(lkey)
local lval
assert(self)
assert(lkey or type(lkey) == 'boolean', 'hash index is nil')
elem.set(key__, lkey)
if C.tds_hash_search(self, key__, val__) == 0 then
lval = elem.get(val__)
end
return lval
end
It is because lkey is nil and type(lkey) != 'boolean'
, then I look into makealigndict.lua,
while true do
local s = source:read()
if s == nil then
break
end
local t = target:read()
local a = alignment:read()
local stoks = tokenize(s)
local ttoks = tokenize(t)
local atoks = tokenize(a)
for _, atok in ipairs(atoks) do
local apair = tablex.map(tonumber, stringx.split(atok, '-'))
local stok = stoks[apair[1] + 1]
local ttok = ttoks[apair[2] + 1]
if not dict[stok] then
dict[stok] = tds.Hash()
end
if not dict[stok][ttok] then
dict[stok][ttok] = 1
else
dict[stok][ttok] = dict[stok][ttok] + 1
end
end
n = n + 1
if n % 25000 == 0 then
print(string.format('Processed %d sentences', n))
end
end
print(string.format('Processed %d sentences', n))
So it is because ttok is nil and type(ttok) != 'boolean'
. I use sed to cut out the part which ran into error.
sed -n '37625000,37650000p' dataset.y > looooook.txt
.
What should I do next ?
Sorry for the delay! Regarding the duplicate indices: this is strange. The indices should correspond to the line numbers in the text file that was binarized. How did you run binarization?
Regarding the aligndict issue: thanks for tracking this down -- you're reading the code wrong though: the assert triggers if the key is nil
(it checks for key
which is false for nil
and for false
, hence the additional check for boolean
). Modify the function like this and process the part that threw the error:
while true do
local s = source:read()
if s == nil then
break
end
local t = target:read()
local a = alignment:read()
local stoks = tokenize(s)
local ttoks = tokenize(t)
local atoks = tokenize(a)
for _, atok in ipairs(atoks) do
local apair = tablex.map(tonumber, stringx.split(atok, '-'))
local stok = stoks[apair[1] + 1]
local ttok = ttoks[apair[2] + 1]
if not stok then
error(string.format("Source token is nil: %s at %d in '%s'", stok, apair[1] + 1, s))
end
if not ttok then
error(string.format("Target token is nil: %s at %d in '%s'", ttok, apair[2] + 1, t))
end
if not dict[stok] then
dict[stok] = tds.Hash()
end
if not dict[stok][ttok] then
dict[stok][ttok] = 1
else
dict[stok][ttok] = dict[stok][ttok] + 1
end
end
n = n + 1
if n % 25000 == 0 then
print(string.format('Processed %d sentences', n))
end
end
print(string.format('Processed %d sentences', n))
I would guess the alignment in question contains invalid indices (the code should just skip these but there doesn't seem to be any error checking).
You are welcome! I locate the error.
The source sentence is creeper rap ||| itunes boom boom boom boom boom boom boom
and the target sentence is creeper rap video
.
The alignment file line is 0-0 1-8
.
It ran into error because of the 8.
scripts/makealigndict.lua:70: Target token is nil: nil at 9 in 'creeper rap video'
I guess it is because of the |||
which you use in build_sym_alignment.lua line 74
?
Thanks!
Oh, yeah. I read the code wrong. I modify the comment in case of misleading others.
Another question.
I was confused about attention index out of bound:
when I ran unkreplace.
if attn < 1 or attn > #stoks then
io.stderr:write(string.format(
'Sentence %d: attention index out of bound: %d\n',
i, attn))
htoks[j] = ''
What I think is as below:
The attention score search.lua
return is (targetlength X sourcelength) entries for every sentence.
And in hooks.lua
, you use torch.max to get the max score's index for every target word.
local _, maxattns = torch.max(attns[hindex], 2)
So I think the index of attns must less than sourcelength, right?
I don't know what will cause the attention index out of bound:
happen.
Thanks!
Ok, so it seems that the alignment is really off. The target sentence has a length smaller than 8. That '|||' marker shouldn't be in the file that you feed to makealigndict. Did you run it with the aligned.grow-diag-final-and
file?
I don't know what will cause the attention index out of bound: happen.
This is just a sanity check since the unkreplace tool reads external data.
Yes, I didn't remove the '|||' in the file. I will do that, thanks!
This is just a sanity check since the unkreplace tool reads external data.
I am sorry I have not express clearly.
I have add unkreplace to generate-lines.lua like what you do in unkreplace.lua
. But I met some attention index out of bound:
when I use fairseq generate-lines ...
.
For example, I input hello
and get a result as below:
A 1 2
H: hello
A 1 2 1
Sentence 2: attention index out of bound: 2
H: hello <unk>
But I can't find why it will happen. Thanks.
Just to double-check, the |||
markers are in the aligned.grow-diag-final-and
file?
What do you mean? The file aligned.grow-diag-final-and
are some numbers.
I use train.x
and train.y
to make aligned.grow-diag-final-and
and train.x
contains a sentences creeper rap ||| itunes boom boom boom boom boom boom boom
.
I do two training at the same time.
One is 40M, and one is 1M which I use head
to cut from 40M for test.
And when I use 1M, it is work well because the train.x
and train.y
don't contains '|||' marker. I get the aligndict.th7 and it can run the unkreplace now but it can't replace all unk with the error attention index out of bound:
.
And when I use 40M, it ran into error in the road to make aligndict because the train.x
contains '|||' marker. And I will remove it and do it again.
Hi, Jonas. @jgehring
I think I found why this happen. In model.lua
, you do maxlen + 1 steps to give model a chance to predict EOS.
So if the source is hello
and target is hello <unk>
, the size of attention entry is 3 * 2
.
And when you want to replace unk, the index is 2 which mean EOS, but the real source len is 1. So attention index out of bound
happen.
Is it right?
Thanks!
Yes that's entirely possible, good catch!
Ok, Thanks!
I found the gen-b**.txt is many unks. I want to replace the unk with your replace method.
In your paper, you said
I think it must be in
hook.lua->hooks.runGeneration
. But I can't understand what happen.I found the
build_sym_alignment.py
, but I can't find where it can be used.