-
# Describe the bug
I got this problem when I ran an RPC program with the PyTorch framework in occlum.
root@tee-node35:/home/llm/RpcLLM/occlum_instance# occlum exec /bin/python3 /src/infer.py --r…
-
Hi
Thanks for your hard work on this project. I just wanted to raise a couple of issues with the plugins on linux.
The linux lv2 is distributed as a bare .so file. LV2 plugins are bundles contai…
-
I have access to a HPC cluster with multiple nodes that each have two GPUs. As I want to do computations that require the memory access of many GPUs, I was looking into the [multi-host](https://jax.re…
-
### System Info
```
- `Accelerate` version: 1.0.1
- Platform: Linux-6.10.4-linuxkit-aarch64-with-glibc2.35
- `accelerate` bash location: /usr/local/bin/accelerate
- Python version: 3.10.12
- N…
-
When trying to use the `ghcr.io/dask/dask-notebook:dev-py3.12` container with Coiled Notebooks it fails to start up. Am I doing something wrong? Do we need to fix something in the Dask container image…
-
Our customers use TigerGraph, not Neo4j. This is because TigerGraph is a distributed graph, and can support queries over multiple servers. We want Med-Graph-RAG to work on existing healthcare graphs …
-
Currently torchao QAT has two APIs, [tensor subclasses](https://github.com/pytorch/ao/blob/a4221df5e10ff8c33854f964fe6b4e00abfbe542/torchao/quantization/prototype/qat/api.py#L41) and [module swap](htt…
-
### 🐛 Describe the bug
Hello, I am working on a project where I need to use multiple consecutive instances of DistributedDataParallel (DDP) within the same torch.distributed environment. In my scen…
-
The unmanaged memory usage is high when I'm read files and process the data. After manually triggering the memory trimming function, the unmanaged memory usage decrease significantly.
```
import…
-
I'm using torch.distributed.rpc package to work on a distributed training POC, currently I'm seeing rpc package itself is using pickle and pickle not work well with some python features like generator…