Great job! I'm curious if there are comparative experiments regarding window_block_indexes and out_feature_indexes settings. Why are the attention settings within the window specifically at layers 0, 1, 3, 6, 7, and 9?
For example, what impact would increasing or decreasing the number of window_block_indexes have on the metrics? Thanks.
Great job! I'm curious if there are comparative experiments regarding window_block_indexes and out_feature_indexes settings. Why are the attention settings within the window specifically at layers 0, 1, 3, 6, 7, and 9? For example, what impact would increasing or decreasing the number of window_block_indexes have on the metrics? Thanks.