OptimalScale / LMFlow

An Extensible Toolkit for Finetuning and Inference of Large Foundation Models. Large Models for All.
https://optimalscale.github.io/LMFlow/
Apache License 2.0
8.26k stars 826 forks source link

if i use llama30b-lora-170k , how to set --model_name_or_path #183

Closed Marine98k closed 1 year ago

Marine98k commented 1 year ago

when i set --model_name_or_path llama33b-lora \ --model_name_or_path: 未找到命令

research4pan commented 1 year ago

Thanks for your interest in LMFlow! Could you please provide the command you are running? Looks like this option is for some examples/*.py. Thanks 😄

2003pro commented 1 year ago

For --model_name_or_path, you can try huggingface model name alias "pinkmanlove/llama-7b-hf" first. Then, it will automatically download model weights from huggingface hub. This may need several minutes/hours to download model parameters.

Also, if you already have model weights in your machine, your set it as a local directory path containing the following files

image
Marine98k commented 1 year ago

Thanks for your interest in LMFlow! Could you please provide the command you are running? Looks like this option is for some examples/*.py. Thanks smile

I am confused about what you mentioned in step 4.1 about obtaining the llama license. Do I need to download the code when I enter the linked github? How can I obtain permission.

Looking forward to your answer

Marine98k commented 1 year ago

This may need several minutes

I have downloaded the tuned model parameters you provided. Do you still need to perform the download llama in step four

shizhediao commented 1 year ago

Hi, If you are using "pinkmanlove/llama-7b-hf", then you can skip step 4.

Marine98k commented 1 year ago

For --model_name_or_path, you can try huggingface model name alias "pinkmanlove/llama-7b-hf" first. Then, it will automatically download model weights from huggingface hub. This may need several minutes/hours to download model parameters.

Also, if you already have model weights in your machine, your set it as a local directory path containing the following files image

i have the fllowing conmand --model_name_or_path llama/pytorch_model-00033-of-00033.bin \ --lora_model_path output_models/instruction_ckpt/llama7b-lora \ --dataset_path data/alpaca/test \ --prompt_structure "Input: {input}" \ --deepspeed examples/ds_config.json error TypeError: expected str, bytes or os.PathLike object, not NoneType ./scripts/run_evaluation_with_lora.sh: 行 10: --model_name_or_path: 未找到命令 pytorch_model-00033-of-00033.bin is llama-7b-hf model when i run python ./scripts/convert_llama_weights_to_hf.py --input_dir llama --model_size 7B --output_dir llama/llama-7b-hf error FileNotFoundError: [Errno 2] No such file or directory: 'llama/7B/params.json' i get confused about step 4 and 5, can you make it clear

research4pan commented 1 year ago

Thanks! Looks like the --model_name_or_path llama/pytorch_model-00033-of-00033.bin is misspecified here, since llama/pytorch_model-00033-of-00033.bin is only 1/33 of the llama model. You may need to specify the model directory that contains this file. Hope than can resolve your problem. Thanks!

Marine98k commented 1 year ago

Thanks! Looks like the --model_name_or_path llama/pytorch_model-00033-of-00033.bin is misspecified here, since llama/pytorch_model-00033-of-00033.bin is only 1/33 of the llama model. You may need to specify the model directory that contains this file. Hope than can resolve your problem. Thanks!

thank you for your replay , i have download all file ,but something wrong like this llama does not support RAM optimized load. Automatically use original load instead. KeyError: 'text' [2023-04-10 19:43:26,644] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 30501 [2023-04-10 19:43:26,644] [ERROR] [launch.py:324:sigkill_handler] ['/home/dell/anaconda3/envs/lmflow/bin/python', '-u', 'examples/evaluate.py', '--local_rank=0', '--answer_type', 'medmcqa', '--model_name_or_path', '/home/dell/zxh/LMFlow-main/llama-7b-hf', '--dataset_path', 'data/MedQA-USMLE/validation', '--deepspeed', 'examples/ds_config.json'] exits with return code = 1 it seems llama cant be used locally, have you test it

research4pan commented 1 year ago

Thanks for providing more information! We've recently fixed a bug for evaluation, could you please check if the code in latest main branch works? Also, it would be nice if you could provide the whole error message and check if finetune with --model_name_or_path pinkmanlove/llama-7b-hf works. Thanks 🙏

Marine98k commented 1 year ago

We've recently fixed a bug for evaluation

(lmflow) dell@dell-Precision-7920-Tower:~/zxh/LMFlow-main$ ./scripts/run_evaluation.sh

[2023-04-11 11:12:41,225] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Detected CUDA_VISIBLE_DEVICES=1: setting --include=localhost:1 [2023-04-11 11:12:41,241] [INFO] [runner.py:550:main] cmd = /home/dell/anaconda3/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None examples/evaluate.py --answer_type medmcqa --model_name_or_path /home/dell/zxh/LMFlow-main/llama-7b-hf --dataset_path data/MedQA-USMLE/validation --deepspeed examples/ds_config.json [2023-04-11 11:12:42,684] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [1]} [2023-04-11 11:12:42,684] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-04-11 11:12:42,684] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-04-11 11:12:42,684] [INFO] [launch.py:162:main] dist_world_size=1 [2023-04-11 11:12:42,684] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=1 llama does not support RAM optimized load. Automatically use original load instead. Loading checkpoint shards: 100%|██████████████████████████████████████████████████| 2/2 [00:07<00:00, 3.62s/it] [2023-04-11 11:13:32,738] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2023-04-11 11:13:32,739] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown [2023-04-11 11:13:38,502] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2023-04-11 11:13:38,503] [INFO] [logging.py:93:log_dist] [Rank 0] Creating BF16 optimizer [2023-04-11 11:13:38,613] [INFO] [utils.py:829:see_memory_usage] begin bf16_optimizer [2023-04-11 11:13:38,613] [INFO] [utils.py:830:see_memory_usage] MA 12.58 GB Max_MA 12.58 GB CA 12.58 GB Max_CA 13 GB [2023-04-11 11:13:38,614] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 16.23 GB, percent = 12.9% Using /home/dell/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Emitting ninja build file /home/dell/.cache/torch_extensions/py39_cu117/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.28022050857543945 seconds [2023-04-11 11:13:39,504] [INFO] [utils.py:829:see_memory_usage] end bf16_optimizer [2023-04-11 11:13:39,505] [INFO] [utils.py:830:see_memory_usage] MA 12.58 GB Max_MA 12.58 GB CA 12.58 GB Max_CA 13 GB [2023-04-11 11:13:39,505] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 16.27 GB, percent = 13.0% [2023-04-11 11:13:39,506] [INFO] [config.py:1018:print] DeepSpeedEngine configuration: [2023-04-11 11:13:39,506] [INFO] [config.py:1022:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-04-11 11:13:39,506] [INFO] [config.py:1022:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-04-11 11:13:39,506] [INFO] [config.py:1022:print] amp_enabled .................. False [2023-04-11 11:13:39,506] [INFO] [config.py:1022:print] amp_params ................... False [2023-04-11 11:13:39,506] [INFO] [config.py:1022:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] bfloat16_enabled ............. True [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] checkpoint_parallel_write_pipeline False [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] checkpoint_tag_validation_enabled True [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] checkpoint_tag_validation_fail False [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f5807f2ae80> [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] communication_data_type ...... None [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] curriculum_enabled_legacy .... False [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] curriculum_params_legacy ..... False [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] data_efficiency_enabled ...... False [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] dataloader_drop_last ......... False [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] disable_allgather ............ False [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] dump_state ................... False [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] dynamic_loss_scale_args ...... None [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] eigenvalue_enabled ........... False [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] eigenvalue_gas_boundary_resolution 1 [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] eigenvalue_layer_num ......... 0 [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] eigenvalue_max_iter .......... 100 [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] eigenvalue_stability ......... 1e-06 [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] eigenvalue_tol ............... 0.01 [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] eigenvalue_verbose ........... False [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] elasticity_enabled ........... False [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] flops_profiler_config ........ { "enabled": false, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] fp16_auto_cast ............... None [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] fp16_enabled ................. False [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] fp16_master_weights_and_gradients False [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] global_rank .................. 0 [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] grad_accum_dtype ............. None [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] gradient_accumulation_steps .. 1 [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] gradient_clipping ............ 0.0 [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] gradient_predivide_factor .... 1.0 [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] initial_dynamic_scale ........ 1 [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] load_universal_checkpoint .... False [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] loss_scale ................... 1.0 [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] memory_breakdown ............. False [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] optimizer_legacy_fusion ...... False [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] optimizer_name ............... None [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] optimizer_params ............. None [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-04-11 11:13:39,507] [INFO] [config.py:1022:print] pld_enabled .................. False [2023-04-11 11:13:39,508] [INFO] [config.py:1022:print] pld_params ................... False [2023-04-11 11:13:39,508] [INFO] [config.py:1022:print] prescale_gradients ........... False [2023-04-11 11:13:39,508] [INFO] [config.py:1022:print] scheduler_name ............... None [2023-04-11 11:13:39,508] [INFO] [config.py:1022:print] scheduler_params ............. None [2023-04-11 11:13:39,508] [INFO] [config.py:1022:print] sparse_attention ............. None [2023-04-11 11:13:39,508] [INFO] [config.py:1022:print] sparse_gradients_enabled ..... False [2023-04-11 11:13:39,508] [INFO] [config.py:1022:print] steps_per_print .............. 2000 [2023-04-11 11:13:39,508] [INFO] [config.py:1022:print] train_batch_size ............. 1 [2023-04-11 11:13:39,508] [INFO] [config.py:1022:print] train_micro_batch_size_per_gpu 1 [2023-04-11 11:13:39,508] [INFO] [config.py:1022:print] use_node_local_storage ....... False [2023-04-11 11:13:39,508] [INFO] [config.py:1022:print] wall_clock_breakdown ......... False [2023-04-11 11:13:39,508] [INFO] [config.py:1022:print] world_size ................... 1 [2023-04-11 11:13:39,508] [INFO] [config.py:1022:print] zero_allow_untested_optimizer False [2023-04-11 11:13:39,508] [INFO] [config.py:1022:print] zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False [2023-04-11 11:13:39,508] [INFO] [config.py:1022:print] zero_enabled ................. False [2023-04-11 11:13:39,508] [INFO] [config.py:1022:print] zero_force_ds_cpu_optimizer .. True [2023-04-11 11:13:39,508] [INFO] [config.py:1022:print] zero_optimization_stage ...... 0 [2023-04-11 11:13:39,508] [INFO] [config.py:1007:print_user_config] json = { "fp16": { "enabled": false }, "bf16": { "enabled": true }, "steps_per_print": 2.000000e+03, "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": false } Using /home/dell/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.00033092498779296875 seconds Found cached dataset json (/home/dell/.cache/huggingface/datasets/json/default-28fa6890c91956f5/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) --------Begin Evaluator Arguments---------- model_args : ModelArguments(model_name_or_path='/home/dell/zxh/LMFlow-main/llama-7b-hf', lora_model_path=None, model_type=None, config_overrides=None, config_name=None, tokenizer_name=None, cache_dir=None, use_fast_tokenizer=True, model_revision='main', use_auth_token=False, torch_dtype=None, use_lora=False, lora_r=8, lora_alpha=32, lora_dropout=0.1, save_aggregated_lora=False, use_ram_optimized_load=False) data_args : DatasetArguments(dataset_path='data/MedQA-USMLE/validation', dataset_name='customized', is_custom_dataset=False, customized_cache_dir='.cache/llm-ft/datasets', dataset_config_name=None, train_file=None, validation_file=None, max_train_samples=None, max_eval_samples=10000000000.0, streaming=False, block_size=None, overwrite_cache=False, validation_split_percentage=5, preprocessing_num_workers=None, disable_group_texts=False, keep_linebreaks=True, test_file=None) evaluator_args : EvaluatorArguments(local_rank=0, random_shuffle=False, use_wandb=False, random_seed=1, output_dir='./output_dir', mixed_precision='bf16', deepspeed='examples/ds_config.json', answer_type='medmcqa', prompt_structure='{input}', evaluate_block_size=512) --------End Evaluator Arguments---------- model_hidden_size = 4096 Traceback (most recent call last): File "/home/dell/zxh/LMFlow-main/examples/evaluate.py", line 46, in evaluator.evaluate(model=model, dataset=dataset, metric='ppl') File "/home/dell/zxh/LMFlow-main/src/lmflow/pipeline/evaluator.py", line 216, in evaluate ppl = self._evaluate_ppl(model, dataset) File "/home/dell/zxh/LMFlow-main/src/lmflow/pipeline/evaluator.py", line 224, in _evaluate_ppl texts = [ instance["text"] for instance in data_dict["instances"] ] File "/home/dell/zxh/LMFlow-main/src/lmflow/pipeline/evaluator.py", line 224, in texts = [ instance["text"] for instance in data_dict["instances"] ] KeyError: 'text' [2023-04-11 11:13:41,754] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 15700 [2023-04-11 11:13:41,755] [ERROR] [launch.py:324:sigkill_handler] ['/home/dell/anaconda3/envs/lmflow/bin/python', '-u', 'examples/evaluate.py', '--local_rank=0', '--answer_type', 'medmcqa', '--model_name_or_path', '/home/dell/zxh/LMFlow-main/llama-7b-hf', '--dataset_path', 'data/MedQA-USMLE/validation', '--deepspeed', 'examples/ds_config.json'] exits with return code = 1

(lmflow) dell@dell-Precision-7920-Tower:~/zxh/LMFlow-main$ CUDA_VISIBLE_DEVICES=1 ./scripts/run_finetune_with_lora.sh [2023-04-11 11:15:27,103] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Detected CUDA_VISIBLE_DEVICES=1: setting --include=localhost:1 [2023-04-11 11:15:27,120] [INFO] [runner.py:550:main] cmd = /home/dell/anaconda3/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMV19 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None examples/finetune.py --model_name_or_path pinkmanlove/llama-7b-hf --dataset_path /home/dell/zxh/LMFlow-main/data/alpaca/train --output_dir /home/dell/zxh/LMFlow-main/output_models/finetune_with_lora --overwrite_output_dir --num_train_epochs 0.01 --learning_rate 1e-4 --block_size 512 --per_device_train_batch_size 1 --use_lora 1 --lora_r 8 --save_aggregated_lora 0 --deepspeed configs/ds_config_zero2.json --bf16 --run_name finetune_with_lora --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1 [2023-04-11 11:15:28,540] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [1]} [2023-04-11 11:15:28,540] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-04-11 11:15:28,540] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-04-11 11:15:28,540] [INFO] [launch.py:162:main] dist_world_size=1 [2023-04-11 11:15:28,540] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=1 [2023-04-11 11:15:31,089] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 04/11/2023 11:15:31 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 04/11/2023 11:15:32 - WARNING - datasets.builder - Found cached dataset json (/home/dell/.cache/huggingface/datasets/json/default-62d3892fe5c8e38a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) Downloading shards: 0%| | 0/2 [00:00<?, ?it/s^C--- Logging error ----of-00002.bin: 1%|▏ | 62.9M/9.98G [00:19<39:33, 4.18MB/s] Downloading shards: 0%| | 0/2 [00:23<?, ?it/s] Although the model has been downloaded, it will still be downloaded when executing the command

research4pan commented 1 year ago

Thanks for providing more details! Maybe the downloading was incomplete last time, you may check ~/.cache/huggingface/hub to see if the size of the downloaded model is normal (7b model approximately occupies 10G-20G).

Also, for finetuning/evaluating LoRA models, we normally need both the original model (base model) and the LoRA model, thus both models require to be downloaded. Here it is downloading the base model, given the LoRA model downloaded. I am wondering if this phenomenon matches your expectation? Thanks!

shizhediao commented 1 year ago

This issue has been marked as stale because it has not had recent activity. If you think this still needs to be addressed please feel free to reopen this issue. Thanks