facebookresearch / metaseq

Repo for external large-scale work
MIT License
6.45k stars 723 forks source link

how to get sharded ckpt #653

Open laozhanghahaha opened 1 year ago

laozhanghahaha commented 1 year ago

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

hey I downloaded the 1.3B ckpt from (https://github.com/facebookresearch/metaseq/tree/main/projects/OPT)

and I try to start finetune by this commad

opt-baselines -n 2 -g 4 -p test_v0 --model-size 1.3b --restore-file 1.3b/reshard.pt --data data-bin/ --checkpoints-dir checkpoints/ --no-save-dir --no-wandb --azure --local

but in the log it tells my No existing checkpoint found 1.3b/reshard-model_part-0-shard0.pt

I tried the convert_to_singleton.py but I only get the retored.pt, how could I get the *****shard0.pt file ?

here is the log

2023-02-17 07:04:55 | INFO | metaseq.utils | CUDA enviroments for all 4 workers 2023-02-17 07:04:55 | INFO | metaseq.cli.train | training on 4 devices (GPUs/TPUs) 2023-02-17 07:04:55 | INFO | metaseq.cli.train | max tokens per GPU = None and batch size per GPU = 32 2023-02-17 07:04:55 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this! 2023-02-17 07:04:55 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this! 2023-02-17 07:04:55 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this! 2023-02-17 07:04:55 | INFO | metaseq.cli.train | nvidia-smi stats: {'gpu_0_mem_used_gb': 6.5791015625, 'gpu_1_mem_used_gb': 12.6201171875, 'gpu_2_mem_used_gb': 3.76953125, 'gpu_3_mem_used_gb': 12.6591796875, 'gpu_4_mem_used_gb': 9.486328125, 'gpu_5_mem_used_gb': 9.619140625, 'gpu_6_mem_used_gb': 9.728515625, 'gpu_7_mem_used_gb': 9.572265625} 2023-02-17 07:04:55 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this! 2023-02-17 07:04:55 | INFO | metaseq.checkpoint_utils | attempting to load checkpoint from: 1.3b/reshard-model_part-0-shard0.pt 2023-02-17 07:04:55 | INFO | metaseq.trainer | No existing checkpoint found 1.3b/reshard-model_part-0-shard0.pt 2023-02-17 07:04:55 | INFO | metaseq.trainer | loading train data for epoch 1

wxthu commented 1 year ago

--data data-bin I want to know where I can get data-bin

laozhanghahaha commented 1 year ago

@wxthu mkdir, then put the data in that folder

wxthu commented 1 year ago

@wxthu mkdir, then put the data in that folder dataset such as GLUE ? I am new to NLP ...

laozhanghahaha commented 1 year ago

@wxthu your dataset should look like this https://github.com/facebookresearch/metaseq/blob/b47f8d115516b539ba0e5002aa3ab707ad10a792/metaseq/tasks/streaming_language_modeling.py#L287