-
```
RETURNN starting up, version 1.20231230.164342+git.f353135e, date/time 2023-12-31-13-21-05 (UTC+0000), pid 2003528, cwd /work/asr4/zeyer/setups-data/comb
ined/2021-05-31/work/i6_core/returnn/t…
-
During the process of distributed training, I encountered the following problem when compiling Triton kernels:
```
Traceback (most recent call last):
......
File "/mnt/petrelfs/caoweihan/anaconda3…
-
Hello,
I am exploring the capabilities of the cuda-checkpoint utility and have a few questions regarding its support for distributed training scenarios:
1. Checkpoint and Resume During Allreduce: Do…
-
I'm not clear with the given procedure for the distributed training. For the first experiment, I have partitioned the PPI dataset inti ppi_data_0.dat and ppi_data_1.dat files and loaded them to HDFS.
…
-
Dear author,
Does SE(3)-Transformer support distributed training in Torch?
Thanks
-
Hi, I found that using Dataparallel is really slow, thus I'm looking at the distributeddataparallel part of the code. However I'm not clear what is the default configuration in order to utilize distri…
-
Hello!
I have a huge dataset which can not be fitted on a single machine and data has much more users than items. Now I'm thinking about training LightFm on cluster. How can I do it?
Can I train …
-
nohup: ignoring input
/root/miniconda3/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.u…
-
Does keras support distributed training? Can I use tensorflow's distributed training tools?
-
Hi,
thanks to all maintainers of this project, that's a great tool to streamline the building and tuning of a Faiss index.
I have a quick dumb question about the training of an index in distribute…