-
### 🐛 Describe the bug
dear, this is my code.
```
....
tensor = torch.rand(1, 1, 4096, dtype=torch.float16)
torch.distributed.all_reduce(tensor)
```
When I run this code, I get an error.
```
…
-
When I generate a variable font, the `public.fontInfo['familyName']` doesn't seem to be considered during compilation.
I see there has been some work done on the fontTools side. Is there still some w…
-
## What needs to be done
Ensure that the column widths in the eCR Library table are more evenly distributed when the window is wide, particularly the Patient column should be wider
## Why it needs t…
-
Hello author, Thanks for your excellent work!
I have some questions about code reproduction to ask you. I retrained for 16 epochs on a single 48GB GPU without using distributed training. The reproduc…
-
### 是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
- [X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
### 该问题是否在FAQ中有解答? | Is there an existing ans…
-
## 🐛 Bug
When using `torch.distributed` training, cannot run an additional independent `suprocess` .
I am using `torch.distributed` to run DDP training. Every once in a while I want to start an …
-
**Describe the bug**
During the PPO actor training run with TensorRT-enabled, there was an error encountered during the validation checkpointing process. The training was conducted using the Tensor…
-
### Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
### Branch Name
main
### Commit ID
newest
### Other Environment Information
```Markdown
- Hardware par…
-
I can't get interval indexing to work on the TrueFX demo. I can't quite figure out the issue, but here's the stacktrace:
```
MethodError: no method matching isless(::IntervalSets.ClosedInterval{Da…
-
I do pretrain with zero3 will got errors, but lora fintune with zero3 is ok.
The error info is:
python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3375, in reduce_scatter_tensor
…
zhww updated
3 months ago