-
### 🐛 Describe the bug
I meet the segment fault when use Megatron-LM firstly, and then I reproduce it using simple torch code.
```
import torch
import os
os.environ['MASTER_ADDR']='{my_ip}'
os…
-
```
This is to log the UPC collectives 2.0 extensions.
For reference, the UPC Collectives 2.0 proposal can be found here:
http://upc.lbl.gov/publications/UPC-Collectives-PGAS11.pdf
```
Original is…
-
# Rule request
## Thesis
This code does not raise a single violation that `x` might not be defined:
```python
try:
function_that_raises()
x = 1
except:
...
print(x)
```…
-
I am getting MemoryError when writing out checkpoints consistently around the time 40 simulations have run (70k atoms, 8 rounds of 5 trajectories at 5ns each) using MaxEnt. I am using g5xlarge nodes o…
-
It would be great if the analyzer could detect that in the following program, `a`, `b`, and `c` are doing nothing useful:
```dart
import 'dart:math';
void main() {
int a = 0;
a = a + 1;
…
Hixie updated
6 years ago
-
Hi team, thanks your great work! We are going to run Horovod on new A100 machines with multiple NICs.
Is is possible to speedup AllReduce by splitting a large tensor to a set of small tensor when…
-
This thread is to track the exact variables we want to isolate as we convert the restraints to the new [CustomCVForce](http://docs.openmm.org/development/api-python/generated/simtk.openmm.openmm.Custo…
-
Thank you for taking the time to submit an issue!
## Background information
### What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
OpenMPI v4.1.x b…
-
I was wondering what exactly is the appropriate way to launch multi-worker distributed training jobs with xmanager. Based on my current understanding, it seems that a `Job` must be created for each wo…
-
**Describe the bug**
I encountered an issue when using DeepSpeed 0.12.4 with the [OpenChat trainer](https://github.com/imoneoi/openchat), where checkpointing failed and raised an NCCL error. However,…