-
### Version
incoming trace version - 2.11.0, outgoing trace version - unknown
### Steps to Reproduce
https://sentry.sentry.io/api/0/projects/sentry/sentry/events/6122ec07d3694b7ba1aa88db962afde5/js…
-
Hello team, I'm new to PyCylon and have an issue related to distributed sorting.
It seems that an empty dataframe in one process triggers an Exception when I perform sort_values in distributed way.…
-
Each node currently caches tracking data in-memory.
The inclusion of Redis would allow us to track "is this cached" at a higher level across nodes.
At a basic level this would be a direct repla…
-
### 🐛 Describe the bug
Calling `.generate` on a HuggingFace model that has been FSDP wrapped results in an error. I was able to work around this error by summoning full params without recurse, which …
-
### Before Creating the Bug Report
- [X] I found a bug, not just asking a question, which should be created in [GitHub Discussions](https://github.com/apache/rocketmq/discussions).
- [X] I have …
-
Hi, following on my previous issue, namely: [here](https://github.com/carla-simulator/scenario_runner/issues/1090#issue-2374650915), and also issue made by another community member [here](https://gith…
-
I try to fine-tune `lmsys/vicuna-7b-v1.3` model.
I have a server with 8 NVIDIA RTX A4500 (20Gb), so in total, about 160Gb of GPU Memory.
When I try to train with `mem` I have OOM in the middle o…
-
Just wondering if you guys have ever considered adding support for distributed local computation (for unreal engine users) through unreal swarm, since apparently, the system is work agnostic.
-
#### Is your feature request related to a problem? Please describe.
cluster_name is used for validating the GEM license, Right now the Release.Name value is used to set the cluster_name in the valu…
-
Thanks for your excellent work. I want to know if I enter 'python -m torch.distributed.launch --nproc_per_node 2 --master_port 12345 main.py', how can I single-step debugging?