-
**Describe the bug**
I have trained a llama-like model with nemo using the below model config:
```
model:
mcore_gpt: True
micro_batch_size: 1
global_batch_size: 512
tensor_model_parallel_size…
-
Loading huggingface `transformers` models is done with the `from_pretrained()` method. For pytorch or safetensors checkpoints, this method expects a `pytorch_model.bin` or `model.safetensors` file fo…
-
The molgen.pkl can not been found, please show its link.
DrugD updated
2 weeks ago
-
Hi
New to dmtcp; when I restart an application from checkpoints the first time works fine, it continues with the process and creates new checkpoints, the applications stops (automatic or manually; …
-
Any configs in which `checkpoint_files` is a list of files > 4, use FormattedFiles utility to shrink the size of the file.
Example from [llama3/70B_lora](https://github.com/pytorch/torchtune/blob/…
-
Hi
I do not understand why you choose to continue training from the checkpoint_latest.pth instead of checkpoint_best.pth.
Checkpoint_latest.pth is saved every 50 epochs, so when we restart; we may…
-
Hello,
I tried to clone the repo, but I got the following error.
Downloading models/ms_ssim-2021cc-1/0_model.pt (164 MB)
Error downloading object: models/ms_ssim-2021cc-1/0_model.pt (43f4625): …
-
With the latest image update (https://github.com/containers/podman/pull/24227) checkpoint is broken inside the container test:
```
→ Enter [It] podman checkpoint container with --pre-checkpoint - /v…
-
I am trying to convert the checkpoint obtained from asynchronous saving of torch_dist to the original torch format, but using convert.py directly results in an error. Could there be an issue with my u…
-
Hi,
I am trying to run the demo; however, the model checkpoints are unavailable.
Could you kindly release the checkpoints?