-
Recently an issue was raised in the SAFE Dojo repo which was ultimately as the result of an error message being a bit buried.
Link: https://github.com/CompositionalIT/SAFE-Dojo/issues/185
The er…
-
## Motivation #
There is significant interest in vLLM supporting encoder/decoder models. Issues #187 and #180 , for example, request encoder/decoder model support. As a result encoder/decoder supp…
-
**Describe the bug**
When attempting to shard a `gemma_2b_en` model across two (consumer-grade) GPUs, I get:
```
ValueError: One of device_put args was given the sharding of NamedSharding(mesh=…
-
Config:
Windows 10 with RTX4090
All requirements incl. flash-attn build - done!
Server:
```
(venv) D:\PythonProjects\hertz-dev>python inference_server.py
Using device: cuda
Loaded tokeniz…
-
Hi,@JWFanggit
This is a very excellent work, and I am currently using it. I would like to ask how the driver attention map in the dataset was generated in advance? Which open-source model was used?
-
It's trying load and never completed
```
Removing download task for Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16): True
0%| …
-
### 🐛 Describe the bug
I'm trying to add micro-benchmark for flex attention, which is implemented by HOP. I use ```torch.utils.flop_counter.FlopCounterMode```, but it doesn't support capture FLOP f…
-
i dont know if this affects anything when i generate i get this
clip missing: ['clip_l.logit_scale', 'clip_l.transformer.text_projection.weight']
Loading 1 new model
C:\Users\heruv\ComfyUI\comfy\ld…
-
### Is your feature request related to a problem? Please describe.
The current implementation causes issues when loading old model checkpoints during inference as it is not clear whether flash attent…
-
It's normal for people to make and release custom builds for the projects that didn't provide any pre-built binary or pre-built Windows binary (i.e., only the source code or ELF binaries are provided,…