IST-DASLab / marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Apache License 2.0
575 stars 45 forks source link

a_sh_rd_delta_o #22

Open Lenan22 opened 5 months ago

Lenan22 commented 5 months ago

constexpr int a_sh_rd_delta_o = 2 * ((threads / 32) / (thread_n_blocks / 4));

  1. Does the 32 here refer to a warp?
  2. What does 4 here mean?
  3. What does 2 here mean?