FMInference / FlexiGen

Running large language models on a single GPU for throughput-oriented scenarios.
Apache License 2.0
9.16k stars 543 forks source link

How do I match the results of profiling with the parameters of the cost model? #131

Open xvanQ opened 8 months ago

xvanQ commented 8 months ago

The output of profile bandwidth is as follows: size: 0.25 MB, gpu-to-cpu bandwidth: 5.505 GB/s size: 32.00 MB, gpu-to-cpu bandwidth: 13.220 GB/s size: 128.00 MB, gpu-to-cpu bandwidth: 13.324 GB/s

size: 0.25 MB, cpu-to-gpu bandwidth: 4.556 GB/s size: 32.00 MB, cpu-to-gpu bandwidth: 12.285 GB/s size: 128.00 MB, cpu-to-gpu bandwidth: 12.251 GB/s

Which is ctog_bdw, which is gtoc_bdw_cache, which is gtoc_bdw_hidden?

The output of profile matmul is as follows: device: cuda, N: 1024, latency: 0.06 ms, TFLOPS: 68.186 device: cuda, N: 2048, latency: 0.20 ms, TFLOPS: 97.026

device: cpu, N: 1024, latency: 0.89 ms, TFLOPS: 3.488 device: cpu, N: 2048, latency: 8.44 ms, TFLOPS: 2.924

which is mm_flops_p, mm_flops_g, bmm_flops_p,bmm_flops_g and cpu_flops? Thanks

nustart0720 commented 5 months ago

Have you figured out this question, I have this question too