From the Model-Breadcrumbs paper, they seem to be doing the top-beta and bottom-gamma pruning per layer independently within a task vector. However, in the implementation in your toolkit, it seems like the top-beta and bottom-gamma pruning is done globally across all layers within a task vector? Wouldn't this potentially do an incorrect pruning (based on what the paper describes) if the per-layer statistics of the weights are quite different, across layers?
Please correct me if I am misunderstanding something. Thanks
Hey,
Thanks for your great work, I have a question about the BreadCrumbs sparsification implementation in https://github.com/arcee-ai/mergekit/blob/57e7d14e2a732f532970e2c9dada00e2d8f15a7a/mergekit/sparsify.py#L61-L100
From the Model-Breadcrumbs paper, they seem to be doing the top-beta and bottom-gamma pruning per layer independently within a task vector. However, in the implementation in your toolkit, it seems like the top-beta and bottom-gamma pruning is done globally across all layers within a task vector? Wouldn't this potentially do an incorrect pruning (based on what the paper describes) if the per-layer statistics of the weights are quite different, across layers?
Please correct me if I am misunderstanding something. Thanks