-
### Possible API
- `Keyspace::backup_to(path: P) -> crate::Result`
- `TxKeyspace::backup_to(path: P) -> crate::Result` (just needs to call inner)
(When we have https://github.com/fjall-rs/fjall…
-
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model par…
-
`load()` takes an argument `step`. Make this configurable in the configuration, so we can choose from which step to continue training.
https://github.com/pytorch/torchtitan/blob/193ce981224f6c040f5a4…
-
Does checkpoint and restore.... work?
Is there a sample, hello world like example, that shows creating checkpoint and restoring it?
So far all of my attempts have failed.
A smoketest demo wou…
-
### System Info
A100
### Who can help?
_No response_
### Information
- [x] The official example scripts
- [ ] My own modified scripts
### Tasks
- [x] An officially supported task in the `exampl…
-
Thank you for your excellent work!
**Question 1:**
I noticed a mismatch between the downloaded checkpoints and the experiment settings. For example, benchmark_mw_DA.sh uses 24, but this checkpoint…
-
How do we change between them?
-
昨天使用sd3去做DreamBooth微调时候,它运行一直报错这个
If your task is similar to the task the model of the checkpoint was trained on,you can already use T5EncoderModel for predictions without further training
A…
-
When checkpointing with `:approx-frequency` configured ~appropriately high (i.e. hours), it could additionally be useful to be able to trigger a checkpoint manually prior to / during the closing of th…
-
When using dlrover to save checkpoints, the following error will always occur:
```
[2024-11-15 12:30:37,876] [INFO] [engine.py:131:start_saver_process] Start a process to asynchronously save checkpo…