choosewhatulike / sparse-sharing

Codes for "Learning Sparse Sharing Architectures for Multiple Tasks"
MIT License
89 stars 26 forks source link

Some questions about the applicable areas of sparse sharing #3

Closed SicongLiang closed 3 years ago

SicongLiang commented 3 years ago

hi, I have read your paper which is very illuminating.

But here I have questions about it: Have you guys tried to apply this method in other areas such as CV? If so, is it still work? Since I'm trying to adapt your method to MTL CV problems, I'm curious about your thoughts :)

BTW, An a little bit earlier Happy New Year for you guys!!!

txsun1997 commented 3 years ago

Thanks for your appreciation! I think it's a very promising exploration to apply the sparse sharing in CV but there are some challenges: the CNNs are more heterogeneous than RNNs so I do not know in which part of network should we apply pruning. But I think if pruning works in CNN, sparse sharing should also works in MTL CV.

Here is another application in machine translation: https://arxiv.org/abs/2012.10586. Hope that helps.

Happy New Year!

SicongLiang commented 3 years ago

Thanks for replying my issue in such short time! Based on your explanation and code, it seems that the generated masks are applied in parameter-level. Now I'm curious about the reason that you think "the CNNs are more heterogeneous than RNNs so I do not know in which part of network should we apply pruning.", cause I'm not very familiar with NLP tasks compared with CV tasks. Is there any implicit fundamental differences in the applicable region in your opinions?

Appreciate again for your time!

txsun1997 commented 3 years ago

RNNs are composed of several homogeneous MLP to control multiple gates and process the features while CNNs contain convolutions, pooling, and MLPs. So we can learn to mask parameters in RNNs without special consideration for where the parameters come from but maybe we cannot do this in CNNs. For a convolution kernel, I think it's a bit weird to get it partially shared between multiple tasks. Instead, I suggest sparse sharing at the structural level with structural pruning.

SicongLiang commented 3 years ago

Sorry to bother you guys again. I'm curious about the "# Params" in the Table 3 in your paper. As far as I'm concerned, this approach have to learn multiple masks for different tasks, and only the parameters of backbone model which are masked out by all the masks could be truly deleted? Besides, refer the Table 7, there aren't many overlap parameters which are masked by all the tasks. In addition this approach also have to store the parameters of all the task-specific masks, so I want to know how you guys get "396k" and "662k" for CoLL-2003 and OntoNotes 5.0, respectively.

Thanks a lot!

txsun1997 commented 3 years ago

# params of sparse sharing is exactly the number of parameters that are masked out by all of the tasks (binary masks are not counted). Table 7 only shows the overlap ratio, not the union.