baidu / Dialogue

444 stars 97 forks source link

Number of attention layers exploration #18

Closed szho42 closed 5 years ago

szho42 commented 5 years ago

attention_performance_multipe_layers r128 3-10-performance_dam

In our projects, we tested the accuracy of using different number of layers. We tested the model: (1) retrieve true responses from 128 (random responses, like the validation data in ubuntu dataset - which has 9 false response). (2) retrieve the true response from a big canned list (rough 300 responses).

As shown in the result, the number of attention layers in our case does not significantly affect the accuracy.

Also, from your experience, the accuracy about 50% top 3 (out of 128) is a reasonable number?

Also, for production, we tested the throughput. Within Hight end GPU, it is about 100ms for one request of batch size of 128; however, it is extremely slow on CPU systems.

Do you guys try to deploy the DAM model on CPUs?

Cheers

szho42 commented 5 years ago

By the way, all the results came from our own dataset. With the same dataset, our LSTM and BiLSTM model performs much better than the DAM model.

szho42 commented 5 years ago

image by the way, that's what we got from our basic LSTM matching model.

xyzhou-puck commented 5 years ago

Hi,

1) Yes, we tried deploying with CPUs and it turned out that you may need to do a lot of optimization to speed up the computation, actually we use GPUs to support our business.

2) According to my own experience, the different setting of parameters and the way you train deep neural networks can significantly impact the model's performance, so maybe you can try different settings, initializations, or training paradigms. For a real-world application, the accuracy that about 50% top 3 is not reasonable, and there is a lot of choices in improving the performance other than models, such as enlarging the training data, getting a better initialization etc. Maybe you can have a try.

Thanks, Xiangyang

szho42 commented 5 years ago

Thanks for the comments.

In our case, the top3 accuracy, 50% is from retrieval the response from ~300 response lists. Do you think it is still not a reasonable number? thanks

xyzhou-puck commented 5 years ago

I think that depends on your purpose. If you want to know if the model is working, then 50% may be reasonable as it is much better than a random guess. But if you want to ship the model into a real-world application, then it is not good enough in my opinion.

szho42 commented 5 years ago

Thanks for your thoughts. In our real-life application, we are achieving top-3 (out of 300 responses) as a 87%, with our own internal dataset. Yes, totally agree with you that 50% chance of suggest right utterance in top 3 (out of 300) might not be ideal, but still acceptable in our business use cases. cheers.