-
I have created new JSON file according to my requirement :
* `training.json`
* `test.json`
the model trains using `training.json` but gives error while calculating val_loss using `test.json`
I …
-
Just for https://github.com/tensorflow/ecosystem/tree/master/docker? I can see the docker file, but not official docker image. Can we provide the official images? Thanks.
-
I have tried run 'python tain_vqvae.py --path '\home\lab\ffhq_dataset' 'in terminal, but there is a error 'module 'torch.distributed' has no ttributed 'launch' '.
I read some other distributed train…
-
Hello, thank you very much for your excellent work. Based on your code, I noticed that even when training with the command:
`python -m torch.distributed.launch --nproc_per_node=4 --use_env tools/tra…
-
When passing parameters to the dataloader in the `TrainDataModule` it may prevent the dataloader from shuffling the data. A fix is to explicitly pass `shuffle=True`. After some further investigation a…
-
Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration
AutoByte: Automatic Configuration for Optimal Communication Scheduling in DNN Training
-
### Description
Here is my use case:
I have 4 gpu nodes for training (including compute tensors) on aws.
I want to save pre-computed tensors to deeplake (Dataset/database/vectorstore), aiming to …
-
Like the title suggests, I’ve managed to get a run going but it crashes with the following traceback
```
Traceback (most recent call last):
File "/home/greg/protein-frame-flow/experiments/train_s…
-
I'm not using distributed training, I changed the code slightly, the command I run on the terminal is:python training/exp_runner.py --local_rank=2 --conf confs/dtu_mlp_3views.conf --scan_id 65,and the…
-
The [LoRA](https://github.com/mlcommons/training/tree/master/llama2_70b_lora) reference implementation has a broken link to an Accelerate config file:
> where the Accelerate config file is [this on…