issues
search
IST-DASLab
/
marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Apache License 2.0
575
stars
45
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Marlin vs gguf
#36
blap
opened
2 weeks ago
1
comparison with FastTransformer for BatchSize=1
#35
matrix97317
opened
3 weeks ago
0
How to move the pointer of B and s in cuda kernel
#34
Eutenacity
opened
1 month ago
0
[SYCL]Add Marlin Kernel for SYCL runtime
#33
abhilash1910
opened
1 month ago
0
comparison with SmoothQuant
#32
lxww302
opened
2 months ago
1
Support w4a8 marlin gemm
#31
HandH1998
opened
2 months ago
0
RuntimeError: CUDA error: an illegal instruction was encountered when runing test.py
#30
MekkCyber
opened
3 months ago
2
Trying to understand the kernel
#29
MekkCyber
opened
3 months ago
0
Why the linear input for Layer.pack "must be of type `torch.half`"?
#28
Azure-Tang
opened
3 months ago
4
slight nondeterminism
#27
MichoChan
opened
3 months ago
1
perfmance
#26
MichoChan
opened
5 months ago
0
cant build marlin
#25
tulipdu955
closed
5 months ago
1
Server, TGI and/or vLLM Support
#24
RonanKMcGovern
closed
5 months ago
7
Issues to generate tokens after "get_llama_marlin"
#23
HaoWuSR
opened
5 months ago
0
a_sh_rd_delta_o
#22
Lenan22
opened
5 months ago
0
Marlin slower than fp16 on larger batches
#21
mobicham
opened
5 months ago
2
Questions about matrix A's layout in shared memory.
#20
HandH1998
opened
5 months ago
0
questions about slice_col_par
#19
Lenan22
opened
5 months ago
2
[QST] Weight Format & GEMM
#18
jeromeku
opened
6 months ago
2
groupsize=64 is not supported
#17
jameswu2014
opened
6 months ago
1
Do you have any plan support moe gemm?
#16
LinHR000
opened
6 months ago
0
Small typo in the shape description (Fixing Issue #14)
#15
saurabhdash
opened
7 months ago
0
Small typo in the shape description
#14
saurabhdash
opened
7 months ago
0
Fix shape check
#13
fxmarty
closed
7 months ago
0
Packing order (`_perm` and `_scale_perm`)
#12
fxmarty
closed
7 months ago
5
can this support lower bit quant?
#11
vince62s
opened
8 months ago
3
Gemm optimizations
#10
efrantar
closed
8 months ago
0
Fix a small typo in annotation
#9
MARD1NO
closed
8 months ago
0
Where in the code uses "immediate eviction" and "fetched from L2 cache"??
#8
ziyuhuang123
opened
8 months ago
2
Support for Hopper H100
#7
rosario-purple
opened
8 months ago
3
[Bug] H800 run UT failed.
#6
Ageliss
opened
8 months ago
3
Does Marlin support zero-point quantization?
#5
casper-hansen
opened
8 months ago
7
Turing support
#4
Dampfinchen
opened
8 months ago
1
Update README.md
#3
eltociear
closed
8 months ago
1
Open: optimize for GEMM regime
#2
fxmarty
closed
7 months ago
7
added conversion script and example
#1
robertgshaw2-neuralmagic
opened
8 months ago
2