Data Preprocessing - Githubissues

Amazing-J / structural-transformer

Code corresponding to our paper "Modeling Graph Structure in Transformer for Better AMR-to-Text Generation" in EMNLP-IJCNLP-2019

75 stars 8 forks source link

Data Preprocessing #1

Open Cartus opened 5 years ago

Cartus commented 5 years ago

Hi, thanks for the great work!

I try to run the code. However, I don't know how to do data preprocessing for AMR corpus. May I ask how can I do data preprocessing?

Amazing-J commented 5 years ago

Our baseline input could be the same linearized amr chart as konstas. Only concept nodes are retained for input to the transformer model. -train_src # concept node sequence -train_structure1 # Xi to Xj path of the first token. -train_structure2 # Xi to Xj path of the second token. ........

Cartus commented 5 years ago

Hi @Amazing-J ,

Thank you for your prompt reply!

For the concept node sequence, I can use NeuralAmr https://github.com/sinantie/NeuralAmr to get the linearized sequence.

I also have two questions. The first one is how to construct the structural sequence. Since the model requires to sub-word units by BPE, how to generate the concept node sequence under this setting?

dungtn commented 5 years ago

Hi @Amazing-J,

Thank you for releasing the code! As @Cartus pointed out, can you provide the code for BPE over the source a.k.a linearized AMRs?

Best!

dungtn commented 5 years ago

Assuming that I've done the right thing for BPE by running

subword-nmt learn-bpe -s 10000 < ...LDC2015E86/training_source > codes.bpe subword-nmt apply-bpe -c codes.bpe < ...LDC2015E86/dev_source > dev_source_bpe

then I still got this error:

FileNotFoundError: [Errno 2] No such file or directory: ...LDC2015E86/data_vocab.pt

How can I generate this file?

dungtn commented 5 years ago

Alright, I found out that I also have to run preprocess.sh. Thanks!