distributed-work Search Results

1000+ results
for distributed-work

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

bytedance/flux #40

[BUG] Illegal memory with multi-node

**Describe the bug** When running GemmRS on two nodes, each with 4 A100 80G connected via NVLINK. Each node has 1 NIC to IB HDR200. ``` W0907 22:34:09.000000 22438061766464 torch/distributed/run.py…

YJHMITWEB updated 2 weeks ago
1
dask/distributed #6247

PyNVML import slowdown?

I was looking through a [flaky test report](https://github.com/dask/distributed/runs/6215173657?check_suite_focus=true) and saw this: ```python-traceback --------------------------- Subprocess s…

mrocklin updated 2 years ago
1
Alpha-VLLM/LLaMA2-Accessory #199

Tensor must be cuda and dense

hello, when run main_finetune.py till 238th row: for param in fsdp_ignored_parameters: dist.broadcast(param.data, src=dist.get_global_rank(fs_init.get_data_parallel_group(), 0), …

bibibabibo26 updated 4 months ago
1
dmlc/dgl #7469

[dev] add command line interface for common utility function…

## 🔨Work Item **IMPORTANT:** * This template is only for dev team to track project progress. For feature request or bug report, please use the corresponding issue templates. * DO NOT create a new…

Rhett-Ying updated 3 months ago
1
jonhiggs/flamingzombies #80

Its too hard to chase the source of errors when using the `f…

Work out how to make debugging easier when the tests are distributed across hosts.

jonhiggs updated 4 months ago
1
Event-AHU/EventVOT_Benchmark #14

How to test on custom data with the pretrained model?

Hi authors! This is nice work and congratulations on securing CVPR24! I managed to deploy it on my machine and managed to test on some data in the test set and the pretrained model gave me amaz…

zptang1210 updated 4 days ago
2
dotnet/aspnetcore #29846

Add activities to Blazor Server for distributed tracing

### Describe the bug The long-lived circuits of Blazor server make distributed tracing not work as expected. Since each circuit is effectively a long-lived request ... a lot of *activity* (pun i…

rynowak updated 1 week ago
36
MrOtherGuy/firefox-csshacks #410

autohide_sidebar.css doesn't expand on mouse hover in ff v13…

i updated ff to v132 b1 today and the sidebar doesn't expand on mouse hover. rolled back to v131, it still works in v131.

sdsucks updated 2 hours ago
4
hustvl/VAD #28

Errors occur in the training commands.

Hello, I have executed the following commands for training purposes. `python -m torch.distributed.run --nproc_per_node=1 --master_port=2333 tools/train.py rojects/configs/VAD/VAD_tiny_stage_1.py --…

h-enomoto updated 1 month ago
2
NVIDIA/nccl #1439

ALLREDUCE timeout

I met the situation when I trained AllSpark on 2 RTX 3090. I have tried so many ways such as increasing 'timeout' of init_process_group, increasing NCCL_BUFFSIZE, set NCCL_P2P_LEVEL=NVL. But all of th…

THEWEAKEST updated 2 weeks ago
9

上一页 1...21 22 23 24 25 26 27...100 下一页

1000+ results for distributed-work

1000+ results
for distributed-work