The error in loading Llama pretrain checkpoint for NeVa(LLAVA)

WeianMao commented 5 months ago

when I train the Neva model, I got following error

[NeMo I 2024-04-12 03:38:58 neva_model:252] Loading LLM weights from checkpoint /home/nemo/llama_weights/vicuna-2-7b.nemo Loading distributed checkpoint with TensorStoreLoadShardedStrategy Error executing job with overrides: ['trainer.precision=bf16', 'trainer.num_nodes=1', 'trainer.devices=1', 'trainer.val_check_interval=1000', 'trainer.limit_val_batches=5', 'trainer.log_every_n_steps=1', 'trainer.max_steps=1000', 'model.megatron_amp_O2=True', 'model.micro_batch_size=1', 'model.global_batch_size=2', 'model.tensor_model_parallel_size=1', 'model.pipeline_model_parallel_size=1', 'model.mcore_gpt=True', 'model.transformer_engine=True', 'model.data.data_path=/data1/data/datasets--liuhaotian--LLaVA-Pretrain/blip_laion_cc_sbu_558k.json', 'model.data.image_folder=/data1/data/datasets--liuhaotian--LLaVA-Pretrain', 'model.tokenizer.library=sentencepiece', 'model.tokenizer.model=/home/nemo/llama_weights/tokenizer_neva.model', 'model.encoder_seq_length=4096', 'model.num_layers=32', 'model.hidden_size=4096', 'model.ffn_hidden_size=16384', 'model.num_attention_heads=32', 'model.normalization=layernorm1p', 'model.do_layer_norm_weight_decay=False', 'model.apply_query_key_layer_scaling=True', 'model.activation=squared-relu', 'model.headscale=False', 'model.position_embedding_type=rope', 'model.rotary_percentage=0.5', 'model.num_query_groups=null', 'model.data.num_workers=0', 'model.mm_cfg.llm.from_pretrained=/home/nemo/llama_weights/vicuna-2-7b.nemo', 'model.mm_cfg.llm.model_type=nvgpt', 'model.data.conv_template=nvgpt', 'model.mm_cfg.vision_encoder.from_pretrained=/home/nemo/openai_weights/clip-vit-large-patch14-336', 'model.mm_cfg.vision_encoder.from_hf=True', 'model.data.image_token_len=256', 'model.optim.name=fused_adam', 'exp_manager.create_checkpoint_callback=True', 'exp_manager.create_wandb_logger=False', 'exp_manager.wandb_logger_kwargs.project=neva_demo'] Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/strategies/tensorstore.py", line 123, in open_ts_array arr = ts.open(ts.Spec(spec), open=True).result() ValueError: NOT_FOUND: Error opening "zarr" driver: Metadata at local file "/tmp/tmpe2_bw_kv/model_weights/model.decoder.layers.self_attention.linear_qkv.layer_norm_bias/.zarray" does not exist [source locations='tensorstore/driver/kvs_backed_chunk_driver.cc:1255\ntensorstore/driver/driver.cc:114'] [tensorstore_spec='{\"context\":{\"cache_pool\":{},\"data_copy_concurrency\":{},\"file_io_concurrency\":{},\"f

Steps/Code to reproduce bug

First, I used following script to convert the Llama hf checkpoint to Nemo checkpoint (I try Vicuna and Llama both, but I got the same error):

python scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path /data1/weight/llama_weights/models--lmsys--vicuna-7b-v1.5 --output_path /home/nemo/llama_weights/vicuna-2-7b.nemo

Then, I launch the train process (I tried 1 gpu and 8 gpu, but I got the same error):

CUDA_VISIBLE_DEVICES=2 NCCL_P2P_DISABLE=1 CUDA_LAUNCH_BLOCKING=1 torchrun --nproc_per_node=1 /opt/NeMo/examples/multimodal/multimodal_llm/neva/neva_pretrain.py \ trainer.precision=bf16 \ trainer.num_nodes=1 \ trainer.devices=1 \ trainer.val_check_interval=1000 \ trainer.limit_val_batches=5 \ trainer.log_every_n_steps=1 \ trainer.max_steps=1000 \ model.megatron_amp_O2=True \ model.micro_batch_size=1 \ model.global_batch_size=2 \ model.tensor_model_parallel_size=1 \ model.pipeline_model_parallel_size=1 \ model.mcore_gpt=True \ model.transformer_engine=True \ model.data.data_path=/data1/data/datasets--liuhaotian--LLaVA-Pretrain/blip_laion_cc_sbu_558k.json \ model.data.image_folder=/data1/data/datasets--liuhaotian--LLaVA-Pretrain \ model.tokenizer.library=sentencepiece \ model.tokenizer.model=/home/nemo/llama_weights/tokenizer_neva.model \ model.encoder_seq_length=4096 \ model.num_layers=32 \ model.hidden_size=4096 \ model.ffn_hidden_size=16384 \ model.num_attention_heads=32 \ model.normalization=layernorm1p \ model.do_layer_norm_weight_decay=False \ model.apply_query_key_layer_scaling=True \ model.activation=squared-relu \ model.headscale=False \ model.position_embedding_type=rope \ model.rotary_percentage=0.5 \ model.num_query_groups=null \ model.data.num_workers=0 \ model.mm_cfg.llm.from_pretrained=/home/nemo/llama_weights/vicuna-2-7b.nemo \ model.mm_cfg.llm.model_type=nvgpt \ model.data.conv_template=nvgpt \ model.mm_cfg.vision_encoder.from_pretrained='/home/nemo/openai_weights/clip-vit-large-patch14-336' \ model.mm_cfg.vision_encoder.from_hf=True \ model.data.image_token_len=256 \ model.optim.name="fused_adam" \ exp_manager.create_checkpoint_callback=True \ exp_manager.create_wandb_logger=False \ exp_manager.wandb_logger_kwargs.project=neva_demo

Expected behavior

the training should start

Environment overview (please complete the following information)

I am in the main brach, I use the docker following:

sudo docker run --runtime=nvidia --gpus all -it --rm -v ~/project/NeMo:/opt/NeMo \ -v /home/nemo:/home/nemo \ -v /data1:/data1 \ --shm-size=8g -p 8888:8888 \ --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:24.01.speech

Environment details

I try to compile the Nemo in the docker. however, It dose not work.

Additional context

8 H800 GPU I'm on the commit:97d1abb2bca0b5daff6d434c4bb340d3bb702e86

WeianMao commented 5 months ago

I'm on the commit:97d1abb2bca0b5daff6d434c4bb340d3bb702e86

github-actions[bot] commented 4 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 4 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

KookHoiKim commented 3 months ago

Does this error solved? It seems that _load_state_dict_from_disk expects model.ckpt file, but the result of untar model.nemo generates model_weights folder.

C080 commented 2 months ago

same issue I manage to run the pretraining script by setting model.mm_cfg.llm.from_pretrained=null and it works but is he seems to pretraing the llm from scratch (?)

NVIDIA / NeMo

The error in loading Llama pretrain checkpoint for NeVa(LLAVA) #8898