Chen-Wang-CUHK / ExHiRD-DKG

The source code for ACL 2020 paper Exclusive Hierarchical Decoding for Deep Keyphrase Generation
https://arxiv.org/pdf/2004.08511.pdf
56 stars 4 forks source link

how to run CatSeq and CatSeqD by your code? #7

Closed johncs999 closed 4 years ago

johncs999 commented 4 years ago

hi, wang, thanks for your codes ! I want to know how to run CatSeq and CatSeqD by your code?

Chen-Wang-CUHK commented 4 years ago

hi, wang, thanks for your codes ! I want to know how to run CatSeq and CatSeqD by your code?

Hi, thank you for your interest. You can download the sh files of CatSeq and CatSeqD via the following link: https://drive.google.com/file/d/1nDscn10W8Dajwvse0XHwkp6HI583OQ1C/view?usp=sharing. The official implementation of these two methods are also public in https://github.com/memray/OpenNMT-kpg-release.

johncs999 commented 4 years ago

Thanks for your detailed reply. BTW, what's the difference between the given vocab.pt file and the generated vocab.pt file by preprocess.py ?

Chen-Wang-CUHK commented 4 years ago

Thanks for your detailed reply. BTW, what's the difference between the given vocab.pt file and the generated vocab.pt file by preprocess.py ?

The given "vocab.pt" is the vocab before "RmKeysAllUnk". We perform "RmKeysAllUnk" based on the given vocab and choose the given vocab as the final vocab. If we generate the vocab after "RmKeysAllUnk", the "RmKeysAllUnk" may be meaningless since some keyphrases may become all unks based on the generated new vocab.

johncs999 commented 4 years ago

Thanks for your reply. It seems that the preprocessing method here is not the same as that of OpenNMT-kpg-release. I am curious why the unk keyphrase should be removed since there is a copy mechanism, and if some keyphrases are removed from the test set, it does not seem to be comparable with the results in OpenNMT-kpg-release. Can you share the preprocessing file ? Thanks.

Chen-Wang-CUHK commented 4 years ago

Thanks for your reply. It seems that the preprocessing method here is not the same as that of OpenNMT-kpg-release. I am curious why the unk keyphrase should be removed since there is a copy mechanism, and if some keyphrases are removed from the test set, it does not seem to be comparable with the results in OpenNMT-kpg-release.

Yes, the preprocessing is different from OpenNMT-kpg-release. The "RmKeysAllUnk" is not performed on the test set. It is only performed on the absent keyphrases of the training and validation sets. The purpose is to remove some invalid training absent keyphrases since they encourage the model to produce unk tokens.

It may take a long time to get the preprocessing files since they are stored in the PC of the lab and unfortunately the PC is broken. Besides, I am not in Hong Kong because of the serious COVID-19 situation. However, I will try to get them and then share them with you.

johncs999 commented 4 years ago

Thanks for your kind reply. I found the numbers of keyphrases are mismatch with the one in OpenNMT-kpg-release on semeval testset:

total absent present
OpenNMT-kpg-release 1507 836 671
ExHiRD 1440 812 628
  1. Is that because you removed some duplicated keyphrases after stemming ?
  2. I found that it seems you use the stemmed keyphrases and unstemmed context in training, did stemmed words and unstemmed words share the same embedding ? If not, would there be a semantic gap ?
Chen-Wang-CUHK commented 4 years ago

Thanks for your kind reply. I found the numbers of keyphrases are mismatch with the one in OpenNMT-kpg-release on semeval testset:

total absent present OpenNMT-kpg-release 1507 836 671 ExHiRD 1440 812 628

  1. Is that because you removed some duplicated keyphrases after stemming ?
  2. I found that it seems you use the stemmed keyphrases and unstemmed context in training, did stemmed words and unstemmed words share the same embedding ? If not, would there be a semantic gap ?
  1. Yes, removing duplications will affect the statistics.
  2. When training, both keyphrases and context are unstemmed.
johncs999 commented 4 years ago

Thanks a lot !

kgarg8 commented 3 years ago

Hi @Chen-Wang-CUHK @johncs999,

Can you share the preprocessing file use to remove unk tokens from absent keyphrases i.e. RmKeysAllUnk version?

Is it the one shared in sh/preprocess folder?