-
DeepSpeed Chat use tensor parallelism via hybrid engine to generate sequence in stage3 training.
I wonder if just use zero3 inference for generation is ok? So that we don't need to transform model pa…
-
### Question Validation
- [X] I have searched both the documentation and discord for an answer.
### Question
Hi there!
**Background:**
```python
from llama_index.core import VectorStoreInde…
-
Currently the GBS blows up to thousands if MBS is more than 1, which is counter-productive to training. And as clusters become larger and the training needs to happen faster this is becoming more and …
-
Hi, I'm training transformer model with Hybrid Sharded Data Parallelism. This setup is similar to FSDP/ZeRO-3 where params all-gather-ed for each layer's forward/backward pass and dropped afterwards. …
-
Plan/progress for SimpleSSD version 2.1 in our internal repo.
Version 2.1 is fully event-driven (v2.0 is functional simulator except HIL).
**SimpleSSD**
- [ ] Revise all source code
- [ ] Host…
-
ducttape-0.3 defaults to depth-first traversal of the realization graph in order to try different kind of tasks quickly (and fail fast). But when the user elects to run multiple processes, this orderi…
-
Hello,
I am new to the cats project & open source in general - cats is a great project & I learnt a lot from it :)
I've been exploring free monads recently & my understanding is that we can't expres…
-
Hi All,
I put the Shared Memory Parallelism commits on the master. This will allow for the splitting of radiation scattering, photosynthesis and thermodynamics of different patches to different CPU …
-
### 🐛 Describe the bug
raise RuntimeError(
RuntimeError: Failed to replace input_layernorm of type LlamaRMSNorm with FusedRMSNorm with the exception: Please install apex from source (https://gith…
-
When we query pinot "with clause" subqueries(as attached query1), the "with clause" subqueries is not pushed into pinot, it's bad for performance. If we removed with clause(as attached query2), it can…