Retrieving source and target vocabs

freesunshine0316 / semantic-nmt

Code corresponding to our paper "Semantic Neural Machine Translation using AMR"

26 stars 5 forks source link

Retrieving source and target vocabs #1

Open Fije opened 5 years ago

Fije commented 5 years ago

Hi, I was wondering how exactly the source and target vocabs for the Dual2Seq experiment are retrieved. Are you using one of your get_vocab scripts? Do you simply concatenate your sentence-words with your AMRnodes+AMRedges-words for the source side?

Many thanks, Fije

freesunshine0316 commented 5 years ago

Hi @Fije Our vocabularies are included in the released data (see repo homepage). To extract vocabularies for new data, we combine the vocabularies for source-side AMRs and for source-side BPEs into the source-side vocabulary, and the target-side vocabulary just comes from the target-side BPEs.

In summary, (source AMRs, source BPE sequences) ==> source vocab, (target BPE sequences) ==> target vocab.

Fije commented 5 years ago

Clear! 1 more question about this: I conclude AMR edges+nodes are extracted automatically (in line 156 in G2S_trainer.py ), so I don't need to add them to my source file "word_vec_src_path" manually. Is that correct?

freesunshine0316 commented 5 years ago

Hi, it's just for edgelabels. We treat the AMR concepts as "words", so they should be in the word_vec_src_path. By the way, you may need to prune some of the AMR concepts before concatenating with the source BPE vocabulary, if the extracted AMR concepts are too many.