-
### Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
### Branch Name
2.0-dev
### Commit ID
96a0524287e30a4892c6f5365541a2d221ed4c37
### Other Environment In…
-
## User Interface
- [x] Sail CLI (#245)
- [x] Sail configuration (#279)
## Core Functionalities
- [x] Distributed processing setup (#244)
- [x] Distributed job stages and shuffle (#265)
- [x] …
-
Input MSAs were truncated to be a single entry (duplicates of the input sequences) because leaving `msa:` blank causes errors for some reason.
```
>101
MGDWSALGRLLDKVQAYSTAGGKVWLSVLFIFRILLLGTAVESA…
-
Subsequent Blockwise layers are currently fused into a single layer. This reduces the number of tasks, the overhead and is very generally a good thing to do. Currently, the fused output does not gener…
-
因为根据代码
if distributed:
sampler = DistributedSampler(dataset) # 似乎没有设置乱序,shuffle应该是默认为false
else:
sampler = RandomSampler(dataset)
同时谢谢您的杰出工作!
-
Hello,
It seems that the dataloader is not adapted to distributed setting (Line 881 at train.py).
The data entries will be repeatedly loaded and trained by different processes.
Maybe a sampler sho…
-
This is the same issue as https://github.com/rapidsai/dask-cuda/issues/1408 . Cross-posting here as it's more related to cuDF instead of `dask-cuda`.
The following snippet works with `DASK_DATAFRAME_…
-
**Describe the bug**
When training a model consuming more memory, I noticed that my training would stop after a constant number of epochs. Upon further investigation, I found that during training / v…
-
**报错信息如下**
```
Traceback (most recent call last):
File "/data1/bert4rec/bert4rec-main/scripts/bole/loaddata_run_product.py", line 5, in
config, model, dataset, train_data, valid_data, test_…
-
**Describe the bug**
If the training data does not live on NFS but on node-specific storage, the current logic in https://github.com/NVIDIA/Megatron-LM/blob/0bc3547702464501feefeb5523b7a17e591b21fa/m…