[TIR][DLight] Enable SimdGroup op for Metal

This PR adds simdgroup intrinsic and tensorization support in DLight rule for Metal backends.

Perf comparison with Kernel from Llama-3 8B

Before

batch_size: 16
Time (ms): 3.42 Total Bytes (MB): 225.00        Memory (GB/s): 64.17    GFLOP/s: 1022.11
batch_size: 32
Time (ms): 4.69 Total Bytes (MB): 226.00        Memory (GB/s): 47.07    GFLOP/s: 1492.82
batch_size: 64
Time (ms): 8.92 Total Bytes (MB): 228.00        Memory (GB/s): 24.96    GFLOP/s: 1569.52

After

batch_size: 16
Time (ms): 1.69 Total Bytes (MB): 225.00        Memory (GB/s): 130.40   GFLOP/s: 2077.10
batch_size: 32
Time (ms): 2.05 Total Bytes (MB): 226.00        Memory (GB/s): 107.78   GFLOP/s: 3418.39
batch_size: 64
Time (ms): 3.96 Total Bytes (MB): 228.00        Memory (GB/s): 56.17    GFLOP/s: 3531.63

apache / tvm

[TIR][DLight] Enable SimdGroup op for Metal #17112