Closed PromptExpert closed 6 years ago
Hi @NLPpupil,
As I stated in the README file Discussion section, the reported acceleration (4~7) is against an un-cached Transformer. The new THUMT has already been engineered with good cache strategy, and under this case, the acceleration can only reach 20%~26% on NIST Chinese-English translation task (see #1 for more details).
I am not sure about your exact experimental settings. For better efficiency, you can disable the FFN activation using the use_ffn
option, i.e. use_ffn=False
.
Thank you. Anyway, I found the decoding is much faster than Tensor2Tensor (10 times +), but the training cost is greater than t2t (10 times).
Hi, it's unreasonable that the speed difference between Tensor2Tensor and Our model at training phase reaches 10 times.
Did you use only 1 GPU card for our model during training? In the README file, I provided my used training settings, where only 1 GPU card with 8 sequential updates is used. However, this is because of our device limitation. To enable multi-gpu card training, you should set batch_size=3125, device_list=[0,1,2,3,4,5,6,7], update_cycle=1
.
I didn't use the parameters in readme because I want to control the resource to single GPU although we have multiple GPU. What's more , I also set constant_batch_size = True to compare THUMT and Tensor2Tensor.
ok, two points:
update_cycle=1
. Alright, I did set updata_cycle=1 when using single GPU.
I have implement the AAN in Tensor2Tensor, and find the training speed and learning curve are similar as the paper have demonstrated. but in decoding, whether i set the num_layers to 1 or 6, the speed is almost the same. I think perhaps because my dataset is chitchat conversation where the length of the src and tgt sentence is shorter(usually < 20 words) and the tgt is always shorter than src. Other, for my consideration, Tensor2Tensor use cache in Transformer by saving the k/v vector of all previous y_input, so it just cost time to calculate current k/v/q and dot-product attention. Instead of these, you just accumulate the y_input, which reduce calculation, but you have to add another FFN and Gate layer to ensure the presentation ability. So its hard to say AAN is more efficient than origin, especially using cache. looking forward to your reply >.<
Hi @Qznan,
Firstly, for AAN, the longer the target sequence compared to its source counterpart, the better acceleration can be reached. The most important thing is not the absolute length of source sequence or target sequence, but their length ratio. So, If your target sequence tends to be shorter than its source part, the acceleration can be degnerated.
Secondly, we also conducted experiments to compare with the cached Transformer. You are right on that partial advantage of AAN over the origin disappers, but AAN still is more effecient than the origin. We can still get an approximate acceleration of 10%. Without FFN, which we suggested, it can reach 25% at least.
In addition, when your engineering is excellet, you can get a good acceleration of ~100%. Hope it can help you.
Thank you for your reply. I will try on AAN without FFN. Anyway it's an excellent work!
By the way, I have one more question, you said the acceleration is relative to the length of source sequence, but ANN just optimal the decode self-attention layer, do not influence the cross-attention layer between source and target sentence. In other words, I think it cost same time in the cross-attention layer in both origin and AAN, let alone to the origin FFN layer above them. So the acceleration is just relative to the target sequence length?
Actually, the acceleration is relative to the length ratio between the target sentence and the source sentence, i.e. (n_tgt + n_src) / n_src. You can find an informal explanation at https://github.com/bzhangXMU/transformer-aan#discussions
I compared ANN with THUMT, trained on 1 million cn-en corpora with base transformer. However, there is no obvious improvement of translation time. They both spent 300-400 sec to translate 10000 test data. So what parameters may I change or what actions to take to differentiate the decoding cost?