Open micky-git opened 2 years ago
另外static const unsigned int lsb_64table(32位系统) 和static const unsigned long long lsb_64table(64位系统)也有10%的性能提升,应该跟符号位和位数对齐提高cache命中有关
https://digitalsystemdesign.in/booths-multiplication-algorithm/ 根据booth 4 乘法器的规则 ACF 需要 4次位移 7次求补码 12次加法 D4F 需要 3次位移 5次求补码 12次加法 B23 需要 5次位移 8次求补码 13次加法 D9B 需要 7次位移 7次求补码 13次加法
应该可以解释为什么0x782C8D4F在测试中有最佳性能(微小的提升)
在inline uint64_t bsf_folded (uint64_t bb)中 使用了magicnum 0x78291ACF
另外三个magicnum 和对应的table为: 0x78291D9B lsb_64_table_0x78291D9B[64] = { 63,30,3,32,50,14,11,33, 23,51,58,9,54,27,20,34, 61,29,2,52,59,22,41,26, 19,55,1,43,46,18,0,35, 62,31,49,4,5,57,53,6, 60,15,12,40,7,42,45,24, 16,48,56,13,10,39,8,44, 47,28,38,21,25,37,36,17 };
0x782C8D4F lsb_64_table_0x782C8D4F[64] = { 63,30,3,32,59,15,12,33, 60,24,51,22,20,43,9,34, 61,29,2,41,11,52,54,19, 56,28,1,44,47,27,0,35, 62,31,58,4,5,50,42,6, 16,40,13,53,55,7,46,17, 25,57,49,14,39,23,21,45, 8,48,38,10,18,37,36,26 };
0x783A9B23 lsb_64_table_0x783A9B23= { 63,30,3,32,25,41,22,33, 15,50,42,13,11,53,19,34, 61,29,2,51,21,43,45,10, 18,47,1,54,9,57,0,35, 62,31,40,4,49,5,52,26, 60,6,23,44,46,27,56,16, 7,39,48,24,59,14,12,55, 38,28,58,20,37,17,36,8 }; 测试中发现 0x782C8D4F 1111000000011001000110101001111 这个值在大量搜索边沿时候似乎有更好的稳定性和效率(10%),可能跟乘法器中间位稀疏有关
理论上四个魔法数应该具有相同的性能 他们都会被编译为 imul eax, eax, magicnum