question about unk replace

nikefd commented 7 years ago

I found the gen-b**.txt is many unks. I want to replace the unk with your replace method.

In your paper, you said

For word-based models, we perform unknown word replacement based on attention scores after generation (Jean et al., 2015). Unknown words are replaced by looking up the source word with the maximum attention score in a precomputed dictionary. If the dictionary contains no translation, then we simply copy the source word.

I think it must be in hook.lua->hooks.runGeneration. But I can't understand what happen.

I found the build_sym_alignment.py, but I can't find where it can be used.

jgehring commented 7 years ago

Hi @nikefd, have a look at scripts/unkreplace.lua. You should run it on a single gen-b*.txt file. It also needs the original source language data and an alignment dictionary. You can create the dictionary with scripts/makealigndict.lua, which in turn requires an alignment file that's generated with scripts/build_sym_alignment.py.

I understand that this is a little involved -- if you have any questions along the way please don't hesitate to ask!

nikefd commented 7 years ago

Ok, it is really helpful! Thanks!

nikefd commented 7 years ago

Hi, Jonas. I build fast_align and mosesdecoder (I use Moses RELEASE-2.1 packages). And I ran the command following: python build_sym_alignment.py --fast_align_dir ~/download/fast_align/build/ --mosesdecoder_dir /opt/moses/ --source_file ../data/qqData/train.x --target_file ../data/qqData/train.y --output_dir dict

The train.x and train.y are both 1M sentences.

After run the command, I got four files align.backward aligned.grow-diag-final-and align.forward text.joined. But I think the file aligned.grow-diag-final-and is not completed. And there is no aligned.sym_heuristic file.

And got an Segmentation fault error in computing grow alignment.

symal: computing grow alignment: diagonal (1) final (1)both-uncovered (1)
*** Segmentation fault
Register dump:

 RAX: 0000000000000010   RBX: 00007ffee1b95b70   RCX: 00007ffee1b95b70
 RDX: 0000000000000000   RSI: 00007ffee1b95b70   RDI: 00007ffee1b923c3
 RBP: 00007ffee1b95b74   R8 : 00000000ffffffff   R9 : 00007ffee1b95c78
 R10: 00007f5231e3c940   R11: 0000000000000246   R12: 00007ffee1b923c3
 R13: 00007ffee1b95b74   R14: 00007ffee1b92410   R15: 0000000000000397
 RSP: 00007ffee1b92380

 RIP: 00007f5231bb7f45   EFLAGS: 00010202

 CS: 0033   FS: 0000   GS: 0000

 Trap: 0000000e   Error: 00000005   OldMask: 00000000   CR2: fffffff8

 FPUCW: 0000037f   FPUSW: 00000000   TAG: 00000000
 RIP: 00000000   RDP: 00000000

 ST(0) 0000 0000000000000000   ST(1) 0000 0000000000000000
 ST(2) 0000 0000000000000000   ST(3) 0000 0000000000000000
 ST(4) 0000 0000000000000000   ST(5) 0000 0000000000000000
 ST(6) 0000 0000000000000000   ST(7) 0000 0000000000000000
 mxcsr: 1f80
 XMM0:  00000000000000000000000000000000 XMM1:  00000000000000000000000000000000
 XMM2:  00000000000000000000000000000000 XMM3:  00000000000000000000000000000000
 XMM4:  00000000000000000000000000000000 XMM5:  00000000000000000000000000000000
 XMM6:  00000000000000000000000000000000 XMM7:  00000000000000000000000000000000
 XMM8:  00000000000000000000000000000000 XMM9:  00000000000000000000000000000000
 XMM10: 00000000000000000000000000000000 XMM11: 00000000000000000000000000000000
 XMM12: 00000000000000000000000000000000 XMM13: 00000000000000000000000000000000
 XMM14: 00000000000000000000000000000000 XMM15: 00000000000000000000000000000000

Backtrace:
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSi6sentryC1ERSib+0x15)[0x7f5231bb7f45]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSirsERi+0x1b)[0x7f5231bb823b]
/opt/moses/bin/symal[0x402c97]
/opt/moses/bin/symal[0x405cca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f523105cf45]
/opt/moses/bin/symal[0x402359]

Memory map:

00400000-00409000 r-xp 00000000 08:03 178625678                          /opt/moses/bin/symal
00608000-00609000 rw-p 00008000 08:03 178625678                          /opt/moses/bin/symal
00609000-0060c000 rw-p 00000000 00:00 0 
018b5000-01979000 rw-p 00000000 00:00 0                                  [heap]
7f523103b000-7f52311f5000 r-xp 00000000 08:03 2173831                    /lib/x86_64-linux-gnu/libc-2.19.so
7f52311f5000-7f52313f5000 ---p 001ba000 08:03 2173831                    /lib/x86_64-linux-gnu/libc-2.19.so
7f52313f5000-7f52313f9000 r--p 001ba000 08:03 2173831                    /lib/x86_64-linux-gnu/libc-2.19.so
7f52313f9000-7f52313fb000 rw-p 001be000 08:03 2173831                    /lib/x86_64-linux-gnu/libc-2.19.so
7f52313fb000-7f5231400000 rw-p 00000000 00:00 0 
7f5231400000-7f5231419000 r-xp 00000000 08:03 2173823                    /lib/x86_64-linux-gnu/libpthread-2.19.so
7f5231419000-7f5231618000 ---p 00019000 08:03 2173823                    /lib/x86_64-linux-gnu/libpthread-2.19.so
7f5231618000-7f5231619000 r--p 00018000 08:03 2173823                    /lib/x86_64-linux-gnu/libpthread-2.19.so
7f5231619000-7f523161a000 rw-p 00019000 08:03 2173823                    /lib/x86_64-linux-gnu/libpthread-2.19.so
7f523161a000-7f523161e000 rw-p 00000000 00:00 0 
7f523161e000-7f5231634000 r-xp 00000000 08:03 2171329                    /lib/x86_64-linux-gnu/libgcc_s.so.1
7f5231634000-7f5231833000 ---p 00016000 08:03 2171329                    /lib/x86_64-linux-gnu/libgcc_s.so.1
7f5231833000-7f5231834000 rw-p 00015000 08:03 2171329                    /lib/x86_64-linux-gnu/libgcc_s.so.1
7f5231834000-7f5231939000 r-xp 00000000 08:03 2173816                    /lib/x86_64-linux-gnu/libm-2.19.so
7f5231939000-7f5231b38000 ---p 00105000 08:03 2173816                    /lib/x86_64-linux-gnu/libm-2.19.so
7f5231b38000-7f5231b39000 r--p 00104000 08:03 2173816                    /lib/x86_64-linux-gnu/libm-2.19.so
7f5231b39000-7f5231b3a000 rw-p 00105000 08:03 2173816                    /lib/x86_64-linux-gnu/libm-2.19.so
7f5231b3a000-7f5231c20000 r-xp 00000000 08:03 223758615                  /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.19
7f5231c20000-7f5231e1f000 ---p 000e6000 08:03 223758615                  /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.19
7f5231e1f000-7f5231e27000 r--p 000e5000 08:03 223758615                  /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.19
7f5231e27000-7f5231e29000 rw-p 000ed000 08:03 223758615                  /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.19
7f5231e29000-7f5231e3e000 rw-p 00000000 00:00 0 
7f5231e3e000-7f5231e45000 r-xp 00000000 08:03 2173830                    /lib/x86_64-linux-gnu/librt-2.19.so
7f5231e45000-7f5232044000 ---p 00007000 08:03 2173830                    /lib/x86_64-linux-gnu/librt-2.19.so
7f5232044000-7f5232045000 r--p 00006000 08:03 2173830                    /lib/x86_64-linux-gnu/librt-2.19.so
7f5232045000-7f5232046000 rw-p 00007000 08:03 2173830                    /lib/x86_64-linux-gnu/librt-2.19.so
7f5232046000-7f523204a000 r-xp 00000000 08:03 2173825                    /lib/x86_64-linux-gnu/libSegFault.so
7f523204a000-7f5232249000 ---p 00004000 08:03 2173825                    /lib/x86_64-linux-gnu/libSegFault.so
7f5232249000-7f523224a000 r--p 00003000 08:03 2173825                    /lib/x86_64-linux-gnu/libSegFault.so
7f523224a000-7f523224b000 rw-p 00004000 08:03 2173825                    /lib/x86_64-linux-gnu/libSegFault.so
7f523224b000-7f523224e000 r-xp 00000000 08:03 2173818                    /lib/x86_64-linux-gnu/libdl-2.19.so
7f523224e000-7f523244d000 ---p 00003000 08:03 2173818                    /lib/x86_64-linux-gnu/libdl-2.19.so
7f523244d000-7f523244e000 r--p 00002000 08:03 2173818                    /lib/x86_64-linux-gnu/libdl-2.19.so
7f523244e000-7f523244f000 rw-p 00003000 08:03 2173818                    /lib/x86_64-linux-gnu/libdl-2.19.so
7f523244f000-7f5232472000 r-xp 00000000 08:03 2173824                    /lib/x86_64-linux-gnu/ld-2.19.so
7f523264e000-7f5232655000 rw-p 00000000 00:00 0 
7f523266f000-7f5232671000 rw-p 00000000 00:00 0 
7f5232671000-7f5232672000 r--p 00022000 08:03 2173824                    /lib/x86_64-linux-gnu/ld-2.19.so
7f5232672000-7f5232673000 rw-p 00023000 08:03 2173824                    /lib/x86_64-linux-gnu/ld-2.19.so
7f5232673000-7f5232674000 rw-p 00000000 00:00 0 
7ffee1b76000-7ffee1b97000 rw-p 00000000 00:00 0                          [stack]
7ffee1bf5000-7ffee1bf7000 r--p 00000000 00:00 0                          [vvar]
7ffee1bf7000-7ffee1bf9000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
sh: line 1: 33675 Segmentation fault      (core dumped) /opt/moses/bin/symal -alignment="grow" -diagonal="yes" -final="yes" -both="yes" > dict/aligned.grow-diag-final-and

I search on google and can't find a way to deal with it. I think maybe the version problem? So I tried the Moses RELEASE-1.0 and got a different error.

sh: /opt/moses/scripts/ems/support/symmetrize-fast-align.perl: No such file or directory
Traceback (most recent call last):
  File "build_sym_alignment.py", line 110, in <module>
    main()
  File "build_sym_alignment.py", line 106, in main
    assert os.system(sym_cmd) == 0
AssertionError

And I tried build moses form source. But also got the following error.

symal: computing grow alignment: diagonal (1) final (1)both-uncovered (1)
sh: line 1:  5421 Segmentation fault      (core dumped) /home/nikefd/download/mosesdecoder/bin/symal -alignment="grow" -diagonal="yes" -final="yes" -both="yes" > dict/aligned.grow-diag-final-and

nikefd commented 7 years ago

So the BLEU score we got in the fair train ... is not the final score? Because it is ahead of processing of unk replacement.

jgehring commented 7 years ago

So the BLEU score we got in the fair train ... is not the final score? Because it is ahead of processing of unk replacement.

Yes, that's true. If you're using BPE, the BLEU score reported will also be over BPE tokens and not over actual words.

Regarding your error, did you try the Moses version from Github? I haven't seen that error before (maybe @michaelauli has?). You should probably try asking on the Moses mailing list (see moses-smt/mosesdecoder#160).

nikefd commented 7 years ago

Ok, I will asking this question in the moses mailing list. Thanks!

nikefd commented 7 years ago

Hi, Jonas. I found if I set the limit the sentences length to 50 words. I can run it success. Thanks! So I have built the alignment file success.

But I have another question. How can I use the file makealigndict.lua and unkreplace.lua ? I tried fariseq makealigndict ... and failed.

module 'fairseq.scripts.makealigndict' not found:No LuaRocks module found for fairseq.scripts.makealigndict

nikefd commented 7 years ago

I guess that I should modify the file CMakeLists.txt.

# Scripts and main executable
FOREACH(SCRIPT preprocess train tofloat generate generate-lines score optimize-fconv help)
  INSTALL(FILES "${SCRIPT}.lua" DESTINATION "${ROCKS_LUADIR}/fairseq/scripts")
ENDFOREACH(SCRIPT)

Add scripts/makealigndict scripts/unkreplace here and make it again.

# Scripts and main executable
FOREACH(SCRIPT preprocess train tofloat generate generate-lines score optimize-fconv help scripts/makealigndict scripts/unkreplace)
  INSTALL(FILES "${SCRIPT}.lua" DESTINATION "${ROCKS_LUADIR}/fairseq/scripts")
ENDFOREACH(SCRIPT)

jgehring commented 7 years ago

The files in scripts/ are helper scripts for specific setups and are thus not part of the main set of tools. Once fairseq has been installed, it's easy to run them via th. For example, if you use sub-word units like BPE codes then you'll likely never need those scripts. Hence, I'd like to keep them separate.

nikefd commented 7 years ago

Ok, got it. But I am confused about how to run them via th ?

jgehring commented 7 years ago

You should be able to run th scripts/unkreplace.lua -help from your shell prompt.

nikefd commented 7 years ago

Thanks, I met a problem when I ran th scripts/unkreplace.lua, the error info is as below: argument 1 expected a 'string', got a 'nil' I found that the index of some sentences in the gen-b*.txt are the same! I have serval same index

170649 S-52    the cast of big valley where are they now
170650 T-52    big valley cast members
170651 H-52    -0.325046       big valley where are they now
170652 A-52    4 5 6 7 8 9 10
...
170777 S-52    how much if i make a small bathroom <unk>
170778 T-52    how much does it cost to remodel a small bathroom
170779 H-52    -0.655210       how much do you make a small bathroom
170780 A-52    8 2 3 8 5 6 7 8 9

The latter's index should be 818.

nikefd commented 7 years ago

Hi, Jonas. @jgehring Another question. In the process of makealigndict, I use th scripts/makealigndict.lua to make aligndict. When I use a dataset of 1M, it work well.

When I use a dataset of 40M, it ran into a error as below

Processed 37575000 sentences    
Processed 37600000 sentences    
Processed 37625000 sentences        
/home/myname/torch/install/bin/luajit: /home/myname/torch/install/share/lua/5.1/tds/hash.lua:76: hash index is nil
stack traceback:
    [C]: in function 'assert'
    /home/myname/torch/install/share/lua/5.1/tds/hash.lua:76: in function '__index'
    scripts/makealigndict.lua:55: in main chunk
    [C]: in function 'dofile'
    .../torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
    [C]: at 0x00406670

I try to find the problem but failed. I look into hash.lua:76

function hash:__index(lkey)
   local lval
   assert(self)
   assert(lkey or type(lkey) == 'boolean', 'hash index is nil')
   elem.set(key__, lkey)
   if C.tds_hash_search(self, key__, val__) == 0 then
      lval = elem.get(val__)
   end
   return lval
end

It is because lkey is nil and type(lkey) != 'boolean', then I look into makealigndict.lua,

while true do
    local s = source:read()
    if s == nil then
        break
    end
    local t = target:read()
    local a = alignment:read()

    local stoks = tokenize(s)
    local ttoks = tokenize(t)
    local atoks = tokenize(a)
    for _, atok in ipairs(atoks) do
        local apair = tablex.map(tonumber, stringx.split(atok, '-'))
        local stok = stoks[apair[1] + 1]
        local ttok = ttoks[apair[2] + 1]
        if not dict[stok] then
            dict[stok] = tds.Hash()
        end
        if not dict[stok][ttok] then
            dict[stok][ttok] = 1
        else
            dict[stok][ttok] = dict[stok][ttok] + 1
        end
    end

    n = n + 1
    if n % 25000 == 0 then
        print(string.format('Processed %d sentences', n))
    end
end
print(string.format('Processed %d sentences', n))

So it is because ttok is nil and type(ttok) != 'boolean'. I use sed to cut out the part which ran into error. sed -n '37625000,37650000p' dataset.y > looooook.txt. What should I do next ?

jgehring commented 7 years ago

Sorry for the delay! Regarding the duplicate indices: this is strange. The indices should correspond to the line numbers in the text file that was binarized. How did you run binarization?

Regarding the aligndict issue: thanks for tracking this down -- you're reading the code wrong though: the assert triggers if the key is nil (it checks for key which is false for nil and for false, hence the additional check for boolean). Modify the function like this and process the part that threw the error:

while true do
    local s = source:read()
    if s == nil then
        break
    end
    local t = target:read()
    local a = alignment:read()

    local stoks = tokenize(s)
    local ttoks = tokenize(t)
    local atoks = tokenize(a)
    for _, atok in ipairs(atoks) do
        local apair = tablex.map(tonumber, stringx.split(atok, '-'))
        local stok = stoks[apair[1] + 1]
        local ttok = ttoks[apair[2] + 1]

        if not stok then
            error(string.format("Source token is nil: %s at %d in '%s'", stok,  apair[1] + 1, s))
        end
        if not ttok then
            error(string.format("Target token is nil: %s at %d in '%s'", ttok,  apair[2] + 1, t))
        end

        if not dict[stok] then
            dict[stok] = tds.Hash()
        end
        if not dict[stok][ttok] then
            dict[stok][ttok] = 1
        else
            dict[stok][ttok] = dict[stok][ttok] + 1
        end
    end

    n = n + 1
    if n % 25000 == 0 then
        print(string.format('Processed %d sentences', n))
    end
end
print(string.format('Processed %d sentences', n))

I would guess the alignment in question contains invalid indices (the code should just skip these but there doesn't seem to be any error checking).

nikefd commented 7 years ago

You are welcome! I locate the error. The source sentence is creeper rap ||| itunes boom boom boom boom boom boom boom and the target sentence is creeper rap video. The alignment file line is 0-0 1-8. It ran into error because of the 8. scripts/makealigndict.lua:70: Target token is nil: nil at 9 in 'creeper rap video' I guess it is because of the ||| which you use in build_sym_alignment.lua line 74? Thanks! Oh, yeah. I read the code wrong. I modify the comment in case of misleading others.

nikefd commented 7 years ago

Another question. I was confused about attention index out of bound: when I ran unkreplace.

if attn < 1 or attn > #stoks then
    io.stderr:write(string.format(
        'Sentence %d: attention index out of bound: %d\n',
        i, attn))
    htoks[j] = ''

What I think is as below: The attention score search.lua return is (targetlength X sourcelength) entries for every sentence. And in hooks.lua, you use torch.max to get the max score's index for every target word. local _, maxattns = torch.max(attns[hindex], 2) So I think the index of attns must less than sourcelength, right?

I don't know what will cause the attention index out of bound: happen. Thanks!

jgehring commented 7 years ago

Ok, so it seems that the alignment is really off. The target sentence has a length smaller than 8. That '|||' marker shouldn't be in the file that you feed to makealigndict. Did you run it with the aligned.grow-diag-final-and file?

I don't know what will cause the attention index out of bound: happen.

This is just a sanity check since the unkreplace tool reads external data.

nikefd commented 7 years ago

Yes, I didn't remove the '|||' in the file. I will do that, thanks!

This is just a sanity check since the unkreplace tool reads external data.

I am sorry I have not express clearly. I have add unkreplace to generate-lines.lua like what you do in unkreplace.lua. But I met some attention index out of bound:when I use fairseq generate-lines .... For example, I input hello and get a result as below:

A   1 2
H：  hello
A   1 2 1
Sentence 2: attention index out of bound: 2
H：  hello <unk>

But I can't find why it will happen. Thanks.

jgehring commented 7 years ago

Just to double-check, the ||| markers are in the aligned.grow-diag-final-and file?

nikefd commented 7 years ago

What do you mean? The file aligned.grow-diag-final-and are some numbers. I use train.x and train.y to make aligned.grow-diag-final-and and train.x contains a sentences creeper rap ||| itunes boom boom boom boom boom boom boom.

nikefd commented 7 years ago

I do two training at the same time. One is 40M, and one is 1M which I use head to cut from 40M for test.

And when I use 1M, it is work well because the train.x and train.y don't contains '|||' marker. I get the aligndict.th7 and it can run the unkreplace now but it can't replace all unk with the error attention index out of bound:.

And when I use 40M, it ran into error in the road to make aligndict because the train.x contains '|||' marker. And I will remove it and do it again.

nikefd commented 7 years ago

Hi, Jonas. @jgehring I think I found why this happen. In model.lua , you do maxlen + 1 steps to give model a chance to predict EOS. So if the source is hello and target is hello <unk>, the size of attention entry is 3 * 2. And when you want to replace unk, the index is 2 which mean EOS, but the real source len is 1. So attention index out of bound happen. Is it right? Thanks!

jgehring commented 7 years ago

Yes that's entirely possible, good catch!

nikefd commented 7 years ago

Ok, Thanks!

facebookresearch / fairseq-lua

question about unk replace #77