IST-DASLab marlin issues

IST-DASLab / marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Apache License 2.0

575 stars 45 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Marlin vs gguf

#36 blap opened 2 weeks ago
1
comparison with FastTransformer for BatchSize=1

#35 matrix97317 opened 3 weeks ago
0
How to move the pointer of B and s in cuda kernel

#34 Eutenacity opened 1 month ago
0
[SYCL]Add Marlin Kernel for SYCL runtime

#33 abhilash1910 opened 1 month ago
0
comparison with SmoothQuant

#32 lxww302 opened 2 months ago
1
Support w4a8 marlin gemm

#31 HandH1998 opened 2 months ago
0
RuntimeError: CUDA error: an illegal instruction was encountered when runing test.py

#30 MekkCyber opened 3 months ago
2
Trying to understand the kernel

#29 MekkCyber opened 3 months ago
0
Why the linear input for Layer.pack "must be of type `torch.half`"?

#28 Azure-Tang opened 3 months ago
4
slight nondeterminism

#27 MichoChan opened 3 months ago
1
perfmance

#26 MichoChan opened 5 months ago
0
cant build marlin

#25 tulipdu955 closed 5 months ago
1
Server, TGI and/or vLLM Support

#24 RonanKMcGovern closed 5 months ago
7
Issues to generate tokens after "get_llama_marlin"

#23 HaoWuSR opened 5 months ago
0
a_sh_rd_delta_o

#22 Lenan22 opened 5 months ago
0
Marlin slower than fp16 on larger batches

#21 mobicham opened 5 months ago
2
Questions about matrix A's layout in shared memory.

#20 HandH1998 opened 5 months ago
0
questions about slice_col_par

#19 Lenan22 opened 5 months ago
2
[QST] Weight Format & GEMM

#18 jeromeku opened 6 months ago
2
groupsize=64 is not supported

#17 jameswu2014 opened 6 months ago
1
Do you have any plan support moe gemm?

#16 LinHR000 opened 6 months ago
0
Small typo in the shape description (Fixing Issue #14)

#15 saurabhdash opened 7 months ago
0
Small typo in the shape description

#14 saurabhdash opened 7 months ago
0
Fix shape check

#13 fxmarty closed 7 months ago
0
Packing order (`_perm` and `_scale_perm`)

#12 fxmarty closed 7 months ago
5
can this support lower bit quant?

#11 vince62s opened 8 months ago
3
Gemm optimizations

#10 efrantar closed 8 months ago
0
Fix a small typo in annotation

#9 MARD1NO closed 8 months ago
0
Where in the code uses "immediate eviction" and "fetched from L2 cache"??

#8 ziyuhuang123 opened 8 months ago
2
Support for Hopper H100

#7 rosario-purple opened 8 months ago
3
[Bug] H800 run UT failed.

#6 Ageliss opened 8 months ago
3
Does Marlin support zero-point quantization?

#5 casper-hansen opened 8 months ago
7
Turing support

#4 Dampfinchen opened 8 months ago
1
Update README.md

#3 eltociear closed 8 months ago
1
Open: optimize for GEMM regime

#2 fxmarty closed 7 months ago
7
added conversion script and example

#1 robertgshaw2-neuralmagic opened 8 months ago
2