SteveKommrusch / OpenNMT-py-ggnn-example

Training, validation, and test files for GGNN encoder option in OpenNMT-py.
10 stars 4 forks source link

build_vocab #3

Open wulidongdong opened 3 years ago

wulidongdong commented 3 years ago

Hi Steve,

I found that if I use the build_vocab script in current OpenNMT_py version (2.0.0), the output vocab file is not compatible with the ggnn encoder. It will raise such a error.

Traceback (most recent call last):
  File "/home/cike/.local/bin/onmt_train", line 10, in <module>
    sys.exit(main())
  File "/home/cike/.local/lib/python3.6/site-packages/onmt/bin/train.py", line 169, in main
    train(opt)
  File "/home/cike/.local/lib/python3.6/site-packages/onmt/bin/train.py", line 154, in train
    train_process(opt, device_id=0)
  File "/home/cike/.local/lib/python3.6/site-packages/onmt/train_single.py", line 107, in main
    valid_steps=opt.valid_steps)
  File "/home/cike/.local/lib/python3.6/site-packages/onmt/trainer.py", line 244, in train
    report_stats)
  File "/home/cike/.local/lib/python3.6/site-packages/onmt/trainer.py", line 368, in _gradient_accumulation
    with_align=self.with_align)
  File "/home/cike/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/cike/.local/lib/python3.6/site-packages/onmt/models/model.py", line 45, in forward
    enc_state, memory_bank, lengths = self.encoder(src, lengths)
  File "/home/cike/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/cike/.local/lib/python3.6/site-packages/onmt/encoders/ggnn_encoder.py", line 182, in forward
    prop_state[i][j][token] = 1
IndexError: index 64 is out of bounds for axis 0 with size 64

But it works fine when I use the srcvocab.txt which is provided in this repo. Do you have any idea how to solve this problem? Thanks for your time.

Xin Wu

SteveKommrusch commented 3 years ago

Xin Wu,

Yes, the current implementation hard-codes a small vocabulary into the RNN size (the vocab can't be larger than the GNN size). I'm working to fix that and have been testing an embedding layer. I'll try to have something testable by Friday.

Regards, Steve

On Sun, Jan 17, 2021 at 5:40 AM wulidongdong notifications@github.com wrote:

Hi Steve,

I found that if I use the build_vocab script in current OpenNMT_py version (2.0.0), the output vocab file is not compatible with the ggnn encoder. It will raise such a error.

Traceback (most recent call last): File "/home/cike/.local/bin/onmt_train", line 10, in sys.exit(main()) File "/home/cike/.local/lib/python3.6/site-packages/onmt/bin/train.py", line 169, in main train(opt) File "/home/cike/.local/lib/python3.6/site-packages/onmt/bin/train.py", line 154, in train train_process(opt, device_id=0) File "/home/cike/.local/lib/python3.6/site-packages/onmt/train_single.py", line 107, in main valid_steps=opt.valid_steps) File "/home/cike/.local/lib/python3.6/site-packages/onmt/trainer.py", line 244, in train report_stats) File "/home/cike/.local/lib/python3.6/site-packages/onmt/trainer.py", line 368, in _gradient_accumulation with_align=self.with_align) File "/home/cike/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "/home/cike/.local/lib/python3.6/site-packages/onmt/models/model.py", line 45, in forward enc_state, memory_bank, lengths = self.encoder(src, lengths) File "/home/cike/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, **kwargs) File "/home/cike/.local/lib/python3.6/site-packages/onmt/encoders/ggnn_encoder.py", line 182, in forward prop_state[i][j][token] = 1 IndexError: index 64 is out of bounds for axis 0 with size 64

But it works fine when I use the srcvocab.txt which is provided in this repo. Do you have any idea how to solve this problem? Thanks for your time.

Xin Wu

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/SteveKommrusch/OpenNMT-py-ggnn-example/issues/3, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGFBJG4KMYUWCAKUWOHUFN3S2LLEXANCNFSM4WGCPV6A .

wulidongdong commented 3 years ago

Xin Wu, Yes, the current implementation hard-codes a small vocabulary into the RNN size (the vocab can't be larger than the GNN size). I'm working to fix that and have been testing an embedding layer. I'll try to have something testable by Friday. Regards, Steve On Sun, Jan 17, 2021 at 5:40 AM wulidongdong @.> wrote: Hi Steve, I found that if I use the build_vocab script in current OpenNMT_py version (2.0.0), the output vocab file is not compatible with the ggnn encoder. It will raise such a error. Traceback (most recent call last): File "/home/cike/.local/bin/onmt_train", line 10, in sys.exit(main()) File "/home/cike/.local/lib/python3.6/site-packages/onmt/bin/train.py", line 169, in main train(opt) File "/home/cike/.local/lib/python3.6/site-packages/onmt/bin/train.py", line 154, in train train_process(opt, device_id=0) File "/home/cike/.local/lib/python3.6/site-packages/onmt/train_single.py", line 107, in main valid_steps=opt.valid_steps) File "/home/cike/.local/lib/python3.6/site-packages/onmt/trainer.py", line 244, in train report_stats) File "/home/cike/.local/lib/python3.6/site-packages/onmt/trainer.py", line 368, in _gradient_accumulation with_align=self.with_align) File "/home/cike/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/home/cike/.local/lib/python3.6/site-packages/onmt/models/model.py", line 45, in forward enc_state, memory_bank, lengths = self.encoder(src, lengths) File "/home/cike/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/cike/.local/lib/python3.6/site-packages/onmt/encoders/ggnn_encoder.py", line 182, in forward prop_state[i][j][token] = 1 IndexError: index 64 is out of bounds for axis 0 with size 64 But it works fine when I use the srcvocab.txt which is provided in this repo. Do you have any idea how to solve this problem? Thanks for your time. Xin Wu — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGFBJG4KMYUWCAKUWOHUFN3S2LLEXANCNFSM4WGCPV6A .

Thank you Steve, that would be great and helpful! I am wondering can I use the old OpenNMT preprocess script to generate vocab files. Which version should I use?

SteveKommrusch commented 3 years ago

Xin Wu,

I have my embedding code passing tests but I'm working through the checkers now for a clean pull request. The new pull request will allow for larger vocabularies and handle the old and new vocab formats, but the vocab file must include , ',', and the numbers up to the node count (so that edge information can be supplied).

To learn a bit more about setup, you can look at my example Github file here: https://github.com/SteveKommrusch/OpenNMT-py-ggnn-example/blob/master/src/setupgraph2seq.sh That file includes a perl line that processes a raw vocab file to add the extra tokens.

Regards, Steve

On Mon, Jan 18, 2021 at 12:17 AM wulidongdong notifications@github.com wrote:

Xin Wu, Yes, the current implementation hard-codes a small vocabulary into the RNN size (the vocab can't be larger than the GNN size). I'm working to fix that and have been testing an embedding layer. I'll try to have something testable by Friday. Regards, Steve … <#m-2403714031098670137> On Sun, Jan 17, 2021 at 5:40 AM wulidongdong @.> wrote: Hi Steve, I found that if I use the build_vocab script in current OpenNMT_py version (2.0.0), the output vocab file is not compatible with the ggnn encoder. It will raise such a error. Traceback (most recent call last): File "/home/cike/.local/bin/onmt_train", line 10, in sys.exit(main()) File "/home/cike/.local/lib/python3.6/site-packages/onmt/bin/train.py", line 169, in main train(opt) File "/home/cike/.local/lib/python3.6/site-packages/onmt/bin/train.py", line 154, in train train_process(opt, device_id=0) File "/home/cike/.local/lib/python3.6/site-packages/onmt/train_single.py", line 107, in main valid_steps=opt.valid_steps) File "/home/cike/.local/lib/python3.6/site-packages/onmt/trainer.py", line 244, in train report_stats) File "/home/cike/.local/lib/python3.6/site-packages/onmt/trainer.py", line 368, in _gradient_accumulation with_align=self.with_align) File "/home/cike/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/home/cike/.local/lib/python3.6/site-packages/onmt/models/model.py", line 45, in forward enc_state, memory_bank, lengths = self.encoder(src, lengths) File "/home/cike/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/cike/.local/lib/python3.6/site-packages/onmt/encoders/ggnn_encoder.py", line 182, in forward prop_state[i][j][token] = 1 IndexError: index 64 is out of bounds for axis 0 with size 64 But it works fine when I use the srcvocab.txt which is provided in this repo. Do you have any idea how to solve this problem? Thanks for your time. Xin Wu — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3 https://github.com/SteveKommrusch/OpenNMT-py-ggnn-example/issues/3>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGFBJG4KMYUWCAKUWOHUFN3S2LLEXANCNFSM4WGCPV6A .

Thank you Steve, that would be great and helpful! I am wondering can I use the old OpenNMT preprocess script to generate vocab files. Which version should I use?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/SteveKommrusch/OpenNMT-py-ggnn-example/issues/3#issuecomment-762038756, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGFBJGZC6A4FJR7ZUEY7FQDS2PN7XANCNFSM4WGCPV6A .

SteveKommrusch commented 3 years ago

Xin Wu,

I have created a pull request to OpenNMT here: https://github.com/OpenNMT/OpenNMT-py/pull/1998

The changes to ggnn_encoder.py allow for an embedding layer, which allows an arbitrarily large vocab to be used. Also, I updated my example code (which relies on the new GGNN code) here: https://github.com/SteveKommrusch/OpenNMT-py-ggnn-example#graph-input-processing-end-to-end-example . Along with the new ggnn_encoder.py you can now use the current onmt_build_vocab to create a file which can be easily adjusted for GGNN usage. That end-to-end example also has a script that can help format textual trees like (a + (b * c) ) into tree structures used by the GGNN.

Let me know if I can help more. If you can't download from my pull request, I could send you the ggnn_encoder.py directly.

Regards, Steve

On Mon, Jan 18, 2021 at 12:17 AM wulidongdong notifications@github.com wrote:

Xin Wu, Yes, the current implementation hard-codes a small vocabulary into the RNN size (the vocab can't be larger than the GNN size). I'm working to fix that and have been testing an embedding layer. I'll try to have something testable by Friday. Regards, Steve … <#m-1730141524549641451> On Sun, Jan 17, 2021 at 5:40 AM wulidongdong @.> wrote: Hi Steve, I found that if I use the build_vocab script in current OpenNMT_py version (2.0.0), the output vocab file is not compatible with the ggnn encoder. It will raise such a error. Traceback (most recent call last): File "/home/cike/.local/bin/onmt_train", line 10, in sys.exit(main()) File "/home/cike/.local/lib/python3.6/site-packages/onmt/bin/train.py", line 169, in main train(opt) File "/home/cike/.local/lib/python3.6/site-packages/onmt/bin/train.py", line 154, in train train_process(opt, device_id=0) File "/home/cike/.local/lib/python3.6/site-packages/onmt/train_single.py", line 107, in main valid_steps=opt.valid_steps) File "/home/cike/.local/lib/python3.6/site-packages/onmt/trainer.py", line 244, in train report_stats) File "/home/cike/.local/lib/python3.6/site-packages/onmt/trainer.py", line 368, in _gradient_accumulation with_align=self.with_align) File "/home/cike/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/home/cike/.local/lib/python3.6/site-packages/onmt/models/model.py", line 45, in forward enc_state, memory_bank, lengths = self.encoder(src, lengths) File "/home/cike/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/cike/.local/lib/python3.6/site-packages/onmt/encoders/ggnn_encoder.py", line 182, in forward prop_state[i][j][token] = 1 IndexError: index 64 is out of bounds for axis 0 with size 64 But it works fine when I use the srcvocab.txt which is provided in this repo. Do you have any idea how to solve this problem? Thanks for your time. Xin Wu — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3 https://github.com/SteveKommrusch/OpenNMT-py-ggnn-example/issues/3>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGFBJG4KMYUWCAKUWOHUFN3S2LLEXANCNFSM4WGCPV6A .

Thank you Steve, that would be great and helpful! I am wondering can I use the old OpenNMT preprocess script to generate vocab files. Which version should I use?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/SteveKommrusch/OpenNMT-py-ggnn-example/issues/3#issuecomment-762038756, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGFBJGZC6A4FJR7ZUEY7FQDS2PN7XANCNFSM4WGCPV6A .

SteveKommrusch commented 3 years ago

The pull request has been accepted so GGNN now supports an embedding layer in the main OpenNMT-py branch.