cookielee77 / DAST

Domain Adaptive Text Style Transfer, EMNLP 2019
70 stars 16 forks source link

an error occurs when running the code #1

Closed wh-yu closed 5 years ago

wh-yu commented 5 years ago

Hi @cookielee77 , thanks for your awesome work and code. I run the following command, but the following error occurs. command:

export TARGET_DATASET=yelp
export SOURCE_DATASET=filter_imdb
export DA_NETWORK=DAST
export TARGET_DATASET_PORTION=1
CUDA_VISIBLE_DEVICES=0 python train_domain_adapt.py --domain_adapt --dataset ${TARGET_DATASET} --source_dataset ${SOURCE_DATASET} --network ${DA_NETWORK} --training_portion ${TARGET_DATASET_PORTION}

error:

Traceback (most recent call last):
  File "train_domain_adapt.py", line 172, in <module>
    target_vocab = Vocabulary(args.target_vocab)
  File "/home/yuweihao/code/DAST/vocab.py", line 19, in __init__
    with open(vocab_file, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/yelp/vocab'

Thank you very much~

wh-yu commented 5 years ago

Hi @cookielee77 , I have solved the problem. I find it should run the command python train_classifier.py --dataset ${TARGET_DATASET} first, then the file vocab will be generated.

wh-yu commented 5 years ago

Hi @cookielee77 , I am running the code but I find it is too slow. I am wondering how much time you run the code for an epoch. Thank you very much.

cookielee77 commented 5 years ago

Ah, I don't remember the exact running time for each epoch. Usually, one task can be finished in half day.

wh-yu commented 5 years ago

Hi @cookielee77 thank you very much for your information. I check my environment and the code is now runing fast. (^_^) Thank you!

wh-yu commented 5 years ago

Hi @cookielee77 , The accuracy of my trained yelp style classifier is 97.62%, and the accuracy of domain classifier is 95.91%.

I further run the task of this setting as shown in README:

export TARGET_DATASET=yelp
export SOURCE_DATASET=filter_imdb
export DA_NETWORK=DAST
export TARGET_DATASET_PORTION=1
CUDA_VISIBLE_DEVICES=0 python train_domain_adapt.py --domain_adapt --dataset ${TARGET_DATASET} --source_dataset ${SOURCE_DATASET} --network ${DA_NETWORK} --training_portion ${TARGET_DATASET_PORTION}

The results are shown in the following:

--------------------epoch 20--------------------
learning_rate: 0.0005  gamma: 0.1000
step 80000, time 29773s, td_loss_rec 1.03, td_loss_g 0.28, td_loss_d 0.00, sd_loss_rec 3.03, sd_loss_g 0.78, sd_loss_d 0.00, domain_loss 9.26
---evaluating target domain:
valid td_loss_rec 0.26, td_loss_g 0.45, td_loss_d 0.09
domain acc: 0.9057
transfer acc: 0.9099
Bleu score: 0.6103
step 81000, time 30204s, td_loss_rec 1.00, td_loss_g 0.29, td_loss_d 0.00, sd_loss_rec 3.17, sd_loss_g 0.73, sd_loss_d 0.00, domain_loss 9.25
---evaluating target domain:
valid td_loss_rec 0.27, td_loss_g 0.36, td_loss_d 0.09
domain acc: 0.9080
transfer acc: 0.9249
Bleu score: 0.5923
step 82000, time 30649s, td_loss_rec 0.98, td_loss_g 0.29, td_loss_d 0.00, sd_loss_rec 3.60, sd_loss_g 0.73, sd_loss_d 0.00, domain_loss 9.34
---evaluating target domain:
valid td_loss_rec 0.25, td_loss_g 0.36, td_loss_d 0.09
domain acc: 0.9080
transfer acc: 0.9243
Bleu score: 0.5859
step 83000, time 31094s, td_loss_rec 0.97, td_loss_g 0.30, td_loss_d 0.00, sd_loss_rec 3.00, sd_loss_g 0.69, sd_loss_d 0.00, domain_loss 9.39
---evaluating target domain:
valid td_loss_rec 0.24, td_loss_g 0.41, td_loss_d 0.09
domain acc: 0.8873
transfer acc: 0.9179
Bleu score: 0.6007
---testing target domain:
test td_loss_rec 0.30, td_loss_g 0.38, td_loss_d 0.10
domain acc: 0.8963
transfer acc: 0.9200
Bleu score: 0.5898

As shown, the domain acc is 89.6% and transfer acc 92.0%. The results reported in papers are domain acc 95.8% and transfer acc 92.3%. There is a margin of domain acc. What should I need to notice when running the code to research comparable domain acc.

Thank you very much for your help.

cookielee77 commented 5 years ago

You should use online-test results, NOT the final testing results (The final testing is only used for the dataset which doesn't have human references). The best online-test results are chosen based on a trade-off between style accuracy and bleu. The online-testing results will also print in the log. Sometimes, the results have some variances based on random seeds or something.

Please check this part: https://github.com/cookielee77/DAST/blob/master/train_domain_adapt.py#L220

Additionally, we provide our testing sample in the repo. You can directly use that for comparison. Usually, you should set training_portion <= 0.1, where our models have best performance.

wh-yu commented 5 years ago

Hi @cookielee77 really thanks for your information. I continue to run the code of the above mentioned setting (target dataset portion still be 1 since I want to replicate the results of this setting first) and notice the online testing results. I have run the code five times with five different random seed. The best results I notice are:

2019-11-05 15:54:47,035 online-test td_loss_rec 0.44, td_loss_g 0.45, td_loss_d 0.09
2019-11-05 15:54:47,036 domain acc: 0.9380
2019-11-05 15:54:47,036 transfer acc: 0.9150
2019-11-05 15:54:47,176 Bleu score: 0.2629 

There is a margin between the above results and those reported in the paper. I am wondering what other hyperparmeters I should notice, like training epoch (I use the default --pretrain_epochs=10 and --max_epochs=20).

Thank you for your kind help again.

cookielee77 commented 5 years ago

I am not very sure about this. I remember that my results running in my server are comparable before I released the code, where I use the same parameters in the repo (I only run twice and the testing samples are provided in the samples folder). Your transfer acc is comparable, but domain acc is somehow a little bit lower. But I don't think these are very significant, because you get similar results. Style transfer tasks with the testing dataset released by the previous paper also suffer some variances in my experiment.

Another thing is, the D-acc, S-acc are depends on the pre-trained classifiers. That's maybe another possibility on your side. Although you got similar test acc, it doesn't mean they may have the same test acc on the online-test dataset (online-test and test datasets are different). The online-test dataset is pretty small (1000 sentences) and suffers more variances on acc. I also have this trouble when compared with previous models because the online-test dataset is not very stable. However, if you use the same style and domain classifiers to fairly all models, the trends are supposed to be similar as reported in the paper.

wh-yu commented 5 years ago

Thanks a lot for your detailed reply! (^_^)