BrambleXu / knowledge-graph-learning

A curated list of awesome knowledge graph tutorials, projects and communities.
MIT License
736 stars 120 forks source link

arXiv-2019/05-Are Sixteen Heads Really Better than One? #252

Open BrambleXu opened 5 years ago

BrambleXu commented 5 years ago

Summary:

受到 #235 启发的文章。与235只对transformer测试不同,这个还在BERT上也尝试了prune。只不过与235的方法不一样,这里参考了一篇CNN剪枝的文章,引入了一个mask variable (from 0 to 1),用来对head进行mask。不过这个还是在finetune之后的处理。对每个head设置mask为0,然后计算144个head在evaluation上的结果,借此来查看每个head的重要性。

Resource:

Paper information:

Notes:

也是针对NMT task,在transformer和BERT上做了分析。

Model Graph:

Result:

Thoughts:

Next Reading: