OPPOMKLab / u-LLaVA

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model
Apache License 2.0
134 stars 6 forks source link

env problem about segmentation fault (core dumped) #3

Closed Ace-blue closed 1 month ago

Ace-blue commented 1 month ago

====== Model Attributes ====== { "arch": "ullava_core", "conv_type": "conv_simple", "llm_path": "/home/zhangzy/data3/VLM/zoo/vicuna-7b-v1.1", "projector_from_scratch": true, "vision_encoder": "/home/zhangzy/data3/VLM/zoo/clip-vit-large-patch14-336", "vision_hidden_layer": -2 }

    ======  Task Attributes  ======
    {
        "collator_type": "image_collator",
        "type": "image_text_pretrain"
    }

    ======  Processor Attributes  ======
    {
        "clip_image": {
            "image_size": 224,
            "path": "/home/zhangzy/data3/VLM/zoo/clip-vit-large-patch14-336"
        }
    }

    ======  Train Dataset Attributes  ======

    ======== llava_cc3m =======
    {
        "build_info": {
            "anno_dir": "/home/zhangzy/data3/VLM/u-LLaVA/dataset_zoo/LLaVA-CC3M-Pretrain-595K/chat.json",
            "image_dir": "/home/zhangzy/data3/VLM/zoo/ullava_image",
            "portion": 1.0
        },
        "data_type": "image",
        "image_token_len": 256,
        "vis_processor": "clip_image"
    }

    ======  Evaluate Dataset Attributes  ======
    [2024-07-23 21:55:40,415] Loading Tokenizer
    [2024-07-23 21:55:40,434] Initializing uLLaVA Core
    Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.65s/it]
    Some weights of UllavaCoreForCausalLM were not initialized from the model checkpoint at /home/zhangzy/data3/VLM/zoo/vicuna-7b-v1.1 and are newly initialized: ['vision_encoder.vision_model.encoder.layers.0.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.13.self_attn.q_proj.weight', 'vision_projector.weight', 'vision_encoder.vision_model.encoder.layers.0.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.13.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.9.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.5.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.1.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.13.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.4.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.23.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.11.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.6.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.11.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.0.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.6.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.22.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.23.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.19.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.1.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.9.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.7.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.9.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.16.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.6.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.20.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.21.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.8.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.8.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.15.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.22.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.14.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.3.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.15.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.22.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.1.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.3.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.6.self_attn.k_proj.bias', 'vision_projector.bias', 'vision_encoder.vision_model.encoder.layers.18.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.6.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.18.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.2.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.2.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.13.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.7.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.17.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.17.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.3.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.13.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.5.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.17.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.17.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.3.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.14.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.2.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.17.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.10.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.0.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.1.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.11.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.17.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.13.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.4.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.5.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.13.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.17.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.4.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.21.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.16.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.13.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.15.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.8.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.20.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.6.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.21.mlp.fc2.weight', 'vision_encoder.vision_model.post_layernorm.weight', 'vision_encoder.vision_model.encoder.layers.12.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.22.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.15.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.15.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.12.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.14.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.19.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.14.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.1.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.7.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.14.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.1.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.17.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.1.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.21.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.17.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.17.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.10.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.5.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.5.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.20.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.6.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.9.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.11.mlp.fc1.weight', 'vision_encoder.vision_model.embeddings.position_ids', 'vision_encoder.vision_model.encoder.layers.12.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.16.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.2.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.8.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.15.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.10.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.21.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.19.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.20.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.10.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.5.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.10.self_attn.q_proj.bias', 'vision_encoder.vision_model.embeddings.class_embedding', 'vision_encoder.vision_model.encoder.layers.2.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.23.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.7.self_attn.q_proj.bias', 'vision_encoder.vision_model.pre_layrnorm.bias', 'vision_encoder.vision_model.encoder.layers.18.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.1.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.13.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.3.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.9.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.10.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.23.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.4.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.21.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.4.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.18.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.18.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.11.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.17.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.4.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.21.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.13.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.22.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.22.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.4.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.5.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.6.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.21.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.11.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.16.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.3.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.5.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.11.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.23.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.7.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.11.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.9.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.15.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.12.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.1.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.8.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.11.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.9.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.3.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.4.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.19.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.4.mlp.fc1.bias', 'vision_encoder.vision_model.post_layernorm.bias', 'vision_encoder.vision_model.encoder.layers.2.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.16.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.9.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.20.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.16.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.10.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.18.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.23.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.18.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.17.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.18.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.5.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.2.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.19.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.3.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.10.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.23.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.7.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.15.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.3.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.11.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.5.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.12.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.14.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.18.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.12.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.22.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.12.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.0.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.7.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.13.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.3.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.4.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.23.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.9.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.19.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.16.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.13.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.22.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.14.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.9.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.18.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.16.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.21.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.6.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.7.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.4.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.10.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.5.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.20.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.22.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.10.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.16.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.23.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.23.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.14.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.0.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.8.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.19.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.14.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.20.mlp.fc2.weight', 'vision_encoder.vision_model.pre_layrnorm.weight', 'vision_encoder.vision_model.encoder.layers.20.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.20.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.18.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.8.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.6.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.20.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.22.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.22.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.8.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.16.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.20.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.12.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.9.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.15.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.19.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.3.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.11.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.0.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.3.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.14.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.8.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.2.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.13.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.2.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.21.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.7.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.8.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.7.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.12.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.14.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.21.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.1.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.11.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.1.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.18.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.19.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.23.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.7.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.2.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.0.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.23.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.2.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.21.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.14.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.12.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.0.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.4.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.4.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.19.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.6.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.12.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.0.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.15.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.9.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.1.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.7.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.14.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.20.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.11.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.9.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.3.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.2.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.11.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.6.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.2.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.2.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.10.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.19.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.3.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.19.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.5.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.4.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.18.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.13.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.20.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.8.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.15.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.15.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.23.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.22.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.10.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.22.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.0.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.13.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.14.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.2.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.5.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.10.mlp.fc2.weight', 'vision_encoder.vision_model.embeddings.position_embedding.weight', 'vision_encoder.vision_model.encoder.layers.16.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.18.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.1.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.7.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.16.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.6.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.22.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.6.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.12.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.10.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.15.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.2.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.6.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.22.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.16.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.0.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.15.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.17.layer_norm2.bias', 'vision_encoder.vision_model.embeddings.patch_embedding.weight', 'vision_encoder.vision_model.encoder.layers.0.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.8.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.8.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.21.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.0.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.11.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.19.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.22.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.12.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.5.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.13.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.9.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.17.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.14.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.21.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.19.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.19.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.9.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.5.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.12.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.20.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.6.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.7.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.20.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.20.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.9.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.12.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.11.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.7.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.1.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.17.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.14.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.1.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.15.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.0.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.4.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.16.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.18.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.10.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.19.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.10.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.7.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.17.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.8.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.16.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.5.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.3.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.1.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.21.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.18.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.8.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.23.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.3.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.23.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.12.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.0.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.8.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.23.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.15.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.16.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.4.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.21.layer_norm2.weight']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    Using pad_token, but it is not set yet.
    [2024-07-23 21:57:04,744] LLaMA model, Loading CLIP Vision Encoder
    Some weights of the model checkpoint at /home/zhangzy/data3/VLM/zoo/clip-vit-large-patch14-336 were not used when initializing CLIPVisionModel: ['text_model.encoder.layers.1.layer_norm2.bias', 'text_model.encoder.layers.3.self_attn.v_proj.bias', 'text_model.encoder.layers.8.self_attn.q_proj.bias', 'text_model.encoder.layers.3.self_attn.k_proj.bias', 'text_model.encoder.layers.6.mlp.fc2.bias', 'text_model.encoder.layers.1.layer_norm1.weight', 'text_model.encoder.layers.5.mlp.fc1.bias', 'text_model.encoder.layers.10.self_attn.v_proj.weight', 'text_model.encoder.layers.2.mlp.fc1.weight', 'text_model.encoder.layers.0.self_attn.k_proj.weight', 'text_model.encoder.layers.9.layer_norm2.weight', 'text_model.encoder.layers.0.self_attn.q_proj.bias', 'text_model.encoder.layers.10.layer_norm2.bias', 'text_model.encoder.layers.8.self_attn.k_proj.bias', 'text_model.encoder.layers.8.layer_norm1.bias', 'text_model.encoder.layers.2.self_attn.k_proj.weight', 'text_model.encoder.layers.10.self_attn.out_proj.bias', 'text_model.encoder.layers.9.mlp.fc2.weight', 'text_model.encoder.layers.3.self_attn.q_proj.bias', 'text_model.encoder.layers.3.layer_norm1.bias', 'text_model.encoder.layers.4.self_attn.q_proj.bias', 'text_model.encoder.layers.7.layer_norm2.weight', 'text_model.encoder.layers.5.layer_norm1.weight', 'text_model.encoder.layers.9.self_attn.k_proj.weight', 'text_model.encoder.layers.9.mlp.fc1.bias', 'text_model.encoder.layers.5.mlp.fc2.weight', 'text_model.encoder.layers.7.layer_norm1.bias', 'text_model.encoder.layers.2.layer_norm2.bias', 'text_model.encoder.layers.10.layer_norm1.bias', 'text_model.encoder.layers.8.self_attn.out_proj.bias', 'text_model.encoder.layers.1.mlp.fc2.weight', 'text_model.encoder.layers.9.self_attn.v_proj.bias', 'text_model.encoder.layers.7.self_attn.k_proj.bias', 'text_model.encoder.layers.9.self_attn.out_proj.bias', 'text_model.encoder.layers.3.layer_norm1.weight', 'text_model.encoder.layers.8.layer_norm2.weight', 'text_model.encoder.layers.6.layer_norm1.weight', 'text_model.encoder.layers.10.self_attn.q_proj.weight', 'text_model.encoder.layers.4.layer_norm1.bias', 'text_model.embeddings.token_embedding.weight', 'text_model.encoder.layers.8.self_attn.k_proj.weight', 'text_model.encoder.layers.10.self_attn.v_proj.bias', 'text_model.encoder.layers.3.self_attn.k_proj.weight', 'text_model.encoder.layers.10.mlp.fc1.weight', 'text_model.encoder.layers.8.self_attn.v_proj.bias', 'text_model.encoder.layers.7.mlp.fc1.bias', 'text_model.encoder.layers.9.layer_norm2.bias', 'text_model.encoder.layers.4.mlp.fc1.bias', 'text_model.encoder.layers.0.mlp.fc2.bias', 'text_model.encoder.layers.9.self_attn.v_proj.weight', 'text_model.encoder.layers.6.self_attn.v_proj.bias', 'text_model.encoder.layers.6.layer_norm2.weight', 'text_model.encoder.layers.11.mlp.fc1.weight', 'text_model.encoder.layers.1.self_attn.q_proj.bias', 'text_model.encoder.layers.11.mlp.fc2.weight', 'text_model.encoder.layers.7.self_attn.q_proj.weight', 'text_model.encoder.layers.3.self_attn.out_proj.weight', 'text_model.encoder.layers.0.mlp.fc1.bias', 'text_model.encoder.layers.1.self_attn.k_proj.weight', 'text_model.encoder.layers.10.self_attn.out_proj.weight', 'text_model.encoder.layers.5.layer_norm1.bias', 'text_model.encoder.layers.11.layer_norm1.weight', 'text_model.encoder.layers.10.mlp.fc1.bias', 'text_model.encoder.layers.10.mlp.fc2.weight', 'text_model.encoder.layers.8.self_attn.v_proj.weight', 'text_model.encoder.layers.5.layer_norm2.bias', 'text_model.encoder.layers.9.self_attn.out_proj.weight', 'text_model.encoder.layers.0.self_attn.out_proj.weight', 'text_model.encoder.layers.11.self_attn.k_proj.bias', 'text_model.encoder.layers.3.self_attn.v_proj.weight', 'text_model.encoder.layers.5.self_attn.k_proj.bias', 'text_model.encoder.layers.0.self_attn.v_proj.weight', 'text_model.encoder.layers.4.layer_norm2.bias', 'text_model.encoder.layers.6.self_attn.q_proj.weight', 'text_model.encoder.layers.4.self_attn.v_proj.weight', 'text_model.encoder.layers.4.mlp.fc2.bias', 'text_model.final_layer_norm.bias', 'text_model.encoder.layers.3.mlp.fc2.bias', 'text_model.encoder.layers.4.self_attn.v_proj.bias', 'text_model.encoder.layers.2.mlp.fc2.bias', 'text_model.encoder.layers.5.self_attn.q_proj.weight', 'text_model.encoder.layers.7.mlp.fc2.bias', 'text_model.encoder.layers.6.mlp.fc1.bias', 'text_model.encoder.layers.7.layer_norm1.weight', 'text_model.embeddings.position_ids', 'text_model.encoder.layers.2.layer_norm2.weight', 'text_model.encoder.layers.10.layer_norm1.weight', 'text_model.encoder.layers.9.self_attn.k_proj.bias', 'text_model.encoder.layers.0.self_attn.k_proj.bias', 'text_model.encoder.layers.11.layer_norm2.weight', 'text_model.encoder.layers.5.self_attn.out_proj.weight', 'text_model.encoder.layers.5.self_attn.v_proj.weight', 'text_model.encoder.layers.3.layer_norm2.weight', 'text_model.encoder.layers.6.layer_norm2.bias', 'text_model.encoder.layers.10.self_attn.q_proj.bias', 'text_model.encoder.layers.4.self_attn.q_proj.weight', 'text_model.encoder.layers.9.layer_norm1.bias', 'text_model.encoder.layers.4.layer_norm1.weight', 'text_model.encoder.layers.9.self_attn.q_proj.weight', 'text_model.encoder.layers.7.self_attn.k_proj.weight', 'text_model.encoder.layers.0.self_attn.q_proj.weight', 'text_model.encoder.layers.5.self_attn.q_proj.bias', 'text_model.encoder.layers.11.self_attn.q_proj.weight', 'text_model.encoder.layers.6.self_attn.v_proj.weight', 'text_model.encoder.layers.4.mlp.fc1.weight', 'text_model.encoder.layers.2.self_attn.out_proj.weight', 'text_model.encoder.layers.11.self_attn.k_proj.weight', 'text_model.encoder.layers.9.mlp.fc2.bias', 'text_model.encoder.layers.1.mlp.fc1.bias', 'text_model.encoder.layers.1.mlp.fc1.weight', 'text_model.encoder.layers.7.self_attn.out_proj.bias', 'text_model.encoder.layers.10.self_attn.k_proj.bias', 'text_model.encoder.layers.9.mlp.fc1.weight', 'text_model.encoder.layers.4.self_attn.out_proj.bias', 'text_model.encoder.layers.5.mlp.fc2.bias', 'text_model.encoder.layers.0.layer_norm2.bias', 'text_model.encoder.layers.11.self_attn.out_proj.weight', 'text_model.encoder.layers.0.layer_norm1.weight', 'text_model.encoder.layers.11.mlp.fc1.bias', 'logit_scale', 'text_model.encoder.layers.8.mlp.fc2.weight', 'text_model.encoder.layers.2.self_attn.v_proj.bias', 'text_model.encoder.layers.8.mlp.fc1.bias', 'text_model.encoder.layers.6.self_attn.out_proj.weight', 'text_model.encoder.layers.11.layer_norm2.bias', 'text_model.encoder.layers.4.layer_norm2.weight', 'text_model.encoder.layers.10.self_attn.k_proj.weight', 'text_model.encoder.layers.6.mlp.fc1.weight', 'text_model.encoder.layers.1.self_attn.out_proj.bias', 'text_model.encoder.layers.5.layer_norm2.weight', 'text_model.encoder.layers.1.mlp.fc2.bias', 'text_model.encoder.layers.6.self_attn.k_proj.bias', 'text_model.encoder.layers.11.self_attn.out_proj.bias', 'text_model.encoder.layers.5.self_attn.k_proj.weight', 'text_model.encoder.layers.5.self_attn.v_proj.bias', 'text_model.encoder.layers.3.layer_norm2.bias', 'text_model.encoder.layers.6.self_attn.out_proj.bias', 'text_model.encoder.layers.2.layer_norm1.bias', 'text_model.encoder.layers.11.layer_norm1.bias', 'text_model.encoder.layers.4.self_attn.out_proj.weight', 'text_model.encoder.layers.2.self_attn.out_proj.bias', 'text_model.encoder.layers.1.layer_norm2.weight', 'text_model.encoder.layers.7.mlp.fc2.weight', 'text_model.encoder.layers.2.self_attn.v_proj.weight', 'text_model.encoder.layers.7.self_attn.q_proj.bias', 'text_model.encoder.layers.3.mlp.fc1.bias', 'text_model.encoder.layers.2.self_attn.k_proj.bias', 'text_model.encoder.layers.8.mlp.fc2.bias', 'text_model.encoder.layers.10.mlp.fc2.bias', 'text_model.encoder.layers.11.self_attn.v_proj.bias', 'text_model.encoder.layers.5.mlp.fc1.weight', 'text_model.encoder.layers.8.layer_norm2.bias', 'text_model.final_layer_norm.weight', 'text_model.encoder.layers.4.self_attn.k_proj.weight', 'text_model.encoder.layers.11.self_attn.v_proj.weight', 'text_model.encoder.layers.8.self_attn.out_proj.weight', 'text_model.encoder.layers.2.layer_norm1.weight', 'text_model.encoder.layers.6.mlp.fc2.weight', 'text_model.encoder.layers.0.mlp.fc1.weight', 'text_model.encoder.layers.1.self_attn.out_proj.weight', 'text_model.encoder.layers.11.mlp.fc2.bias', 'text_projection.weight', 'text_model.encoder.layers.0.self_attn.out_proj.bias', 'text_model.encoder.layers.8.mlp.fc1.weight', 'text_model.encoder.layers.0.layer_norm2.weight', 'text_model.encoder.layers.6.self_attn.q_proj.bias', 'text_model.encoder.layers.0.mlp.fc2.weight', 'text_model.encoder.layers.7.self_attn.v_proj.bias', 'text_model.encoder.layers.7.mlp.fc1.weight', 'text_model.encoder.layers.0.layer_norm1.bias', 'text_model.embeddings.position_embedding.weight', 'text_model.encoder.layers.6.layer_norm1.bias', 'text_model.encoder.layers.7.self_attn.out_proj.weight', 'text_model.encoder.layers.2.mlp.fc2.weight', 'text_model.encoder.layers.8.layer_norm1.weight', 'text_model.encoder.layers.6.self_attn.k_proj.weight', 'text_model.encoder.layers.7.self_attn.v_proj.weight', 'text_model.encoder.layers.1.layer_norm1.bias', 'text_model.encoder.layers.3.mlp.fc2.weight', 'text_model.encoder.layers.2.self_attn.q_proj.weight', 'text_model.encoder.layers.1.self_attn.v_proj.bias', 'text_model.encoder.layers.0.self_attn.v_proj.bias', 'text_model.encoder.layers.5.self_attn.out_proj.bias', 'text_model.encoder.layers.9.layer_norm1.weight', 'text_model.encoder.layers.11.self_attn.q_proj.bias', 'text_model.encoder.layers.4.self_attn.k_proj.bias', 'text_model.encoder.layers.1.self_attn.v_proj.weight', 'text_model.encoder.layers.7.layer_norm2.bias', 'text_model.encoder.layers.4.mlp.fc2.weight', 'text_model.encoder.layers.3.mlp.fc1.weight', 'text_model.encoder.layers.2.self_attn.q_proj.bias', 'text_model.encoder.layers.1.self_attn.q_proj.weight', 'text_model.encoder.layers.3.self_attn.out_proj.bias', 'text_model.encoder.layers.9.self_attn.q_proj.bias', 'text_model.encoder.layers.1.self_attn.k_proj.bias', 'text_model.encoder.layers.2.mlp.fc1.bias', 'text_model.encoder.layers.8.self_attn.q_proj.weight', 'text_model.encoder.layers.3.self_attn.q_proj.weight', 'visual_projection.weight', 'text_model.encoder.layers.10.layer_norm2.weight']
    - This IS expected if you are initializing CLIPVisionModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
    - This IS NOT expected if you are initializing CLIPVisionModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    [2024-07-23 21:57:25,970] Number of newly added tokens: 7
    [2024-07-23 21:57:25,970] Pre-training stage
    [2024-07-23 21:57:25,975] BUILDING PROCESSOR 1: clip_image
    Using /home/zhangzy/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
    Detected CUDA files, patching ldflags
    Emitting ninja build file /home/zhangzy/.cache/torch_extensions/py39_cu117/cpu_adam/build.ninja...
    Building extension module cpu_adam...
    Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
    ninja: no work to do.
    Loading extension module cpu_adam...
    Time to load cpu_adam op: 3.9097278118133545 seconds
    Using /home/zhangzy/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
    Emitting ninja build file /home/zhangzy/.cache/torch_extensions/py39_cu117/utils/build.ninja...
    Building extension module utils...
    Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
    ninja: no work to do.
    Loading extension module utils...
    Time to load utils op: 1.1912968158721924 seconds
    [1]    324832 segmentation fault (core dumped)  python train_ullava_core.py --cfg_path 

First thx for Greate work.

How to solve the bug segmentation fault (core dumped) ? The shell output is above.

VeritasXu commented 1 month ago

Hello,

Thanks for your comment.

According to your info, the clip image encoder used was 336 size, but the token length of image was set to 256 (for vit-l-14-224), Please modify the value of the key "image_token_len" in the config to 576 (for vit-l-14-336).

Or, we will update the code that supports LLaVA1.5 (vicuna1.1) in several weeks.

VeritasXu commented 1 month ago

for example

model:
  arch: 'ullava_core'
  llm_path: '/home/zhangzy/data3/VLM/zoo/vicuna-7b-v1.1'
  vision_encoder: '/home/zhangzy/data3/VLM/zoo/clip-vit-large-patch14-336'
  vision_hidden_layer: -2
  projector_from_scratch: true
  conv_type: 'conv_simple'

task:
  type: image_text_pretrain
  collator_type: 'image_collator'

processor:
  clip_image:
    path: '/home/zhangzy/data3/VLM/zoo/clip-vit-large-patch14-336'
    image_size: 336

dataset:
  llava_cc3m:
    data_type: 'image'
    image_token_len: 576
    build_info:
      anno_dir: '/home/zhangzy/data3/VLM/u-LLaVA/dataset_zoo/LLaVA-CC3M-Pretrain-595K/chat.json'
      image_dir: '/home/zhangzy/data3/VLM/zoo/ullava_image'
      portion: 1.0
    vis_processor: 'clip_image'
Ace-blue commented 1 month ago

Thanks for your timely response.

I've tried setting the key "image_token_len" to 576 (for vit-l-14-336) or replacing vit-l-14-336 with vit-l-14. But the bug segmentation fault (core dumped) still occurs.

Could you provide more information about the environment or configuration?

(老哥, 这个bug 卡了我几天了, 换机器/重启的招数都试了, 就很奇怪, 能不能救救弟弟😭😭

Ace-blue commented 1 month ago

Here is my shell output :

  python train_ullava_core.py --cfg_path ./configs/train/ullava_core_stage1.yaml                                                19:42:20 

  ======  Model Attributes  ======
  {
      "arch": "ullava_core",
      "conv_type": "conv_simple",
      "llm_path": "model_zoo/vicuna_7b_v1.1",
      "projector_from_scratch": true,
      "vision_encoder": "model_zoo/clip-vit-large-patch14",
      "vision_hidden_layer": -2
  }

  ======  Task Attributes  ======
  {
      "collator_type": "image_video_collator",
      "type": "image_text_pretrain"
  }

  ======  Processor Attributes  ======
  {
      "clip_image": {
          "image_size": 224,
          "path": "model_zoo/clip-vit-large-patch14"
      }
  }

  ======  Train Dataset Attributes  ======

  ======== llava_cc3m =======
  {
      "build_info": {
          "anno_dir": "./dataset_zoo/LLaVA-CC3M-Pretrain-595K/chat.json",
          "image_dir": "./dataset_zoo/LLaVA-CC3M-Pretrain-595K/ullava_image",
          "portion": 1.0
      },
      "data_type": "image",
      "image_token_len": 256,
      "vis_processor": "clip_image"
  }

  ======  Evaluate Dataset Attributes  ======
  [2024-07-24 19:42:35,532] Loading Tokenizer
  [2024-07-24 19:42:35,563] Initializing uLLaVA Core
  Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:58<00:00, 59.05s/it]
  Some weights of UllavaCoreForCausalLM were not initialized from the model checkpoint at model_zoo/vicuna_7b_v1.1 and are newly initialized: ['vision_encoder.vision_model.encoder.layers.15.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.14.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.9.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.1.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.19.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.4.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.12.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.16.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.21.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.23.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.12.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.20.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.11.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.23.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.16.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.2.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.15.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.14.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.6.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.17.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.20.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.8.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.0.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.14.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.4.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.7.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.4.self_attn.k_proj.weight', 'vision_encoder.vision_model.embeddings.position_ids', 'vision_encoder.vision_model.encoder.layers.17.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.0.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.2.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.6.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.6.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.10.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.23.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.3.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.22.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.8.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.9.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.19.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.13.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.1.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.12.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.22.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.22.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.3.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.7.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.20.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.1.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.15.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.12.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.23.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.21.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.2.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.1.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.17.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.2.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.0.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.9.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.9.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.16.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.10.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.6.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.3.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.17.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.3.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.12.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.21.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.23.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.18.self_attn.q_proj.bias', 'vision_encoder.vision_model.embeddings.position_embedding.weight', 'vision_encoder.vision_model.encoder.layers.21.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.22.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.16.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.8.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.19.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.19.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.17.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.4.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.17.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.18.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.6.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.12.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.14.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.19.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.13.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.8.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.21.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.20.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.0.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.16.mlp.fc2.weight', 'vision_encoder.vision_model.pre_layrnorm.bias', 'vision_encoder.vision_model.encoder.layers.16.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.10.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.9.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.0.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.1.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.5.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.16.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.3.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.11.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.16.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.13.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.11.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.5.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.1.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.0.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.2.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.19.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.8.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.5.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.0.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.14.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.1.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.9.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.8.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.8.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.15.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.14.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.2.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.7.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.11.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.10.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.5.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.10.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.17.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.18.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.12.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.15.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.9.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.11.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.14.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.4.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.18.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.9.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.23.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.0.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.3.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.10.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.1.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.2.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.13.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.18.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.19.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.15.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.21.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.13.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.4.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.1.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.7.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.8.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.4.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.11.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.16.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.11.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.4.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.5.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.0.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.0.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.17.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.11.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.13.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.23.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.3.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.20.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.13.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.3.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.10.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.5.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.18.self_attn.v_proj.weight', 'vision_projector.weight', 'vision_encoder.vision_model.encoder.layers.17.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.15.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.1.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.6.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.16.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.5.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.22.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.2.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.3.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.3.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.9.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.21.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.1.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.11.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.7.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.15.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.15.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.9.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.15.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.19.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.2.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.18.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.4.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.20.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.7.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.12.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.8.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.13.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.17.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.6.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.17.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.8.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.14.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.12.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.2.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.11.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.14.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.21.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.11.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.6.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.15.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.19.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.22.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.2.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.5.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.17.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.10.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.16.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.13.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.3.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.22.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.15.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.14.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.4.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.13.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.7.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.20.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.7.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.11.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.23.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.4.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.7.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.16.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.22.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.5.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.20.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.15.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.4.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.16.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.9.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.10.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.23.mlp.fc1.bias', 'vision_encoder.vision_model.embeddings.patch_embedding.weight', 'vision_encoder.vision_model.encoder.layers.21.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.3.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.20.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.7.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.0.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.22.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.19.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.5.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.18.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.12.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.10.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.10.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.15.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.18.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.5.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.18.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.7.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.22.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.7.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.0.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.21.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.6.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.14.self_attn.q_proj.weight', 'vision_encoder.vision_model.pre_layrnorm.weight', 'vision_encoder.vision_model.encoder.layers.10.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.14.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.23.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.23.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.22.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.6.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.13.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.3.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.10.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.8.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.18.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.1.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.18.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.6.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.18.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.22.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.12.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.16.layer_norm1.weight', 'vision_encoder.vision_model.post_layernorm.weight', 'vision_encoder.vision_model.encoder.layers.11.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.21.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.8.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.17.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.19.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.19.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.9.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.18.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.5.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.12.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.9.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.8.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.2.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.17.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.20.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.8.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.5.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.7.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.20.self_attn.out_proj.weight', 'vision_projector.bias', 'vision_encoder.vision_model.encoder.layers.11.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.17.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.6.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.23.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.20.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.18.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.11.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.8.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.5.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.3.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.4.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.22.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.6.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.1.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.17.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.12.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.23.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.15.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.0.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.2.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.0.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.2.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.13.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.10.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.22.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.22.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.23.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.20.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.21.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.7.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.12.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.1.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.21.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.12.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.6.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.21.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.5.layer_norm1.bias', 'vision_encoder.vision_model.embeddings.class_embedding', 'vision_encoder.vision_model.encoder.layers.0.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.22.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.8.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.16.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.14.mlp.fc2.weight', 'vision_encoder.vision_model.post_layernorm.bias', 'vision_encoder.vision_model.encoder.layers.10.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.21.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.19.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.16.self_attn.q_proj.weight', 'vision_encoder.vision_model.encoder.layers.4.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.9.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.4.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.10.layer_norm2.weight', 'vision_encoder.vision_model.encoder.layers.19.mlp.fc2.weight', 'vision_encoder.vision_model.encoder.layers.13.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.13.self_attn.out_proj.bias', 'vision_encoder.vision_model.encoder.layers.1.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.2.mlp.fc2.bias', 'vision_encoder.vision_model.encoder.layers.21.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.15.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.4.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.7.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.20.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.5.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.3.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.14.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.20.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.1.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.6.self_attn.out_proj.weight', 'vision_encoder.vision_model.encoder.layers.13.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.0.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.6.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.20.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.9.self_attn.k_proj.bias', 'vision_encoder.vision_model.encoder.layers.12.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.14.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.3.self_attn.q_proj.bias', 'vision_encoder.vision_model.encoder.layers.9.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.13.layer_norm1.bias', 'vision_encoder.vision_model.encoder.layers.23.layer_norm1.weight', 'vision_encoder.vision_model.encoder.layers.18.mlp.fc1.weight', 'vision_encoder.vision_model.encoder.layers.19.self_attn.v_proj.weight', 'vision_encoder.vision_model.encoder.layers.23.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.19.self_attn.v_proj.bias', 'vision_encoder.vision_model.encoder.layers.14.self_attn.k_proj.weight', 'vision_encoder.vision_model.encoder.layers.11.layer_norm2.bias', 'vision_encoder.vision_model.encoder.layers.7.mlp.fc1.bias', 'vision_encoder.vision_model.encoder.layers.2.self_attn.q_proj.bias']
  You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  Using pad_token, but it is not set yet.
  [2024-07-24 19:45:48,941] LLaMA model, Loading CLIP Vision Encoder
  Some weights of the model checkpoint at model_zoo/clip-vit-large-patch14 were not used when initializing CLIPVisionModel: ['text_model.encoder.layers.4.self_attn.k_proj.weight', 'text_model.encoder.layers.4.mlp.fc2.weight', 'text_model.encoder.layers.11.self_attn.v_proj.bias', 'text_model.encoder.layers.11.layer_norm1.weight', 'text_model.encoder.layers.1.self_attn.out_proj.bias', 'text_model.encoder.layers.5.self_attn.v_proj.weight', 'text_model.encoder.layers.5.layer_norm2.bias', 'text_model.encoder.layers.2.self_attn.v_proj.bias', 'text_model.encoder.layers.4.layer_norm2.weight', 'text_model.encoder.layers.8.self_attn.v_proj.bias', 'text_model.encoder.layers.10.layer_norm2.weight', 'text_model.encoder.layers.11.self_attn.k_proj.weight', 'text_model.final_layer_norm.weight', 'text_model.encoder.layers.4.layer_norm2.bias', 'text_model.encoder.layers.10.mlp.fc2.weight', 'text_model.encoder.layers.9.self_attn.v_proj.bias', 'text_model.encoder.layers.5.self_attn.out_proj.bias', 'text_model.encoder.layers.7.layer_norm2.bias', 'text_model.encoder.layers.11.self_attn.out_proj.weight', 'text_model.encoder.layers.11.mlp.fc1.bias', 'text_model.encoder.layers.8.layer_norm2.bias', 'text_model.encoder.layers.6.mlp.fc2.bias', 'text_model.encoder.layers.9.mlp.fc2.bias', 'text_model.encoder.layers.5.mlp.fc1.bias', 'text_model.encoder.layers.8.self_attn.k_proj.weight', 'text_model.encoder.layers.5.mlp.fc2.bias', 'text_model.encoder.layers.0.layer_norm1.weight', 'text_model.encoder.layers.3.mlp.fc2.bias', 'text_projection.weight', 'text_model.encoder.layers.9.self_attn.k_proj.bias', 'text_model.encoder.layers.2.layer_norm2.weight', 'text_model.encoder.layers.10.self_attn.k_proj.weight', 'text_model.encoder.layers.1.mlp.fc1.weight', 'text_model.encoder.layers.5.layer_norm1.weight', 'visual_projection.weight', 'text_model.encoder.layers.9.mlp.fc1.weight', 'text_model.encoder.layers.5.self_attn.out_proj.weight', 'text_model.encoder.layers.0.mlp.fc1.weight', 'text_model.encoder.layers.4.self_attn.v_proj.bias', 'text_model.encoder.layers.2.self_attn.q_proj.bias', 'text_model.encoder.layers.9.self_attn.q_proj.weight', 'text_model.encoder.layers.0.self_attn.k_proj.weight', 'text_model.encoder.layers.3.mlp.fc1.bias', 'text_model.encoder.layers.7.self_attn.v_proj.weight', 'text_model.encoder.layers.6.self_attn.v_proj.weight', 'text_model.encoder.layers.1.self_attn.out_proj.weight', 'text_model.encoder.layers.9.self_attn.out_proj.weight', 'text_model.encoder.layers.8.self_attn.q_proj.bias', 'text_model.encoder.layers.3.self_attn.v_proj.bias', 'text_model.encoder.layers.11.layer_norm2.weight', 'text_model.encoder.layers.9.layer_norm2.weight', 'text_model.encoder.layers.3.self_attn.k_proj.bias', 'text_model.encoder.layers.2.self_attn.v_proj.weight', 'text_model.encoder.layers.10.mlp.fc1.weight', 'text_model.encoder.layers.7.self_attn.out_proj.weight', 'text_model.encoder.layers.7.layer_norm1.weight', 'text_model.encoder.layers.2.layer_norm1.weight', 'text_model.encoder.layers.10.self_attn.out_proj.weight', 'text_model.encoder.layers.10.self_attn.q_proj.weight', 'text_model.encoder.layers.11.mlp.fc2.bias', 'text_model.encoder.layers.9.mlp.fc1.bias', 'text_model.encoder.layers.5.self_attn.k_proj.bias', 'text_model.encoder.layers.11.layer_norm2.bias', 'text_model.encoder.layers.8.self_attn.v_proj.weight', 'text_model.encoder.layers.1.layer_norm2.bias', 'text_model.encoder.layers.1.mlp.fc2.bias', 'text_model.encoder.layers.1.self_attn.k_proj.weight', 'text_model.encoder.layers.9.mlp.fc2.weight', 'text_model.embeddings.position_embedding.weight', 'text_model.encoder.layers.4.self_attn.q_proj.bias', 'text_model.encoder.layers.4.self_attn.out_proj.weight', 'text_model.encoder.layers.5.self_attn.k_proj.weight', 'text_model.encoder.layers.8.layer_norm1.bias', 'text_model.encoder.layers.10.layer_norm1.bias', 'text_model.encoder.layers.3.self_attn.q_proj.weight', 'text_model.encoder.layers.6.self_attn.q_proj.bias', 'logit_scale', 'text_model.encoder.layers.8.self_attn.out_proj.weight', 'text_model.encoder.layers.1.layer_norm1.weight', 'text_model.encoder.layers.6.self_attn.v_proj.bias', 'text_model.encoder.layers.6.self_attn.out_proj.bias', 'text_model.encoder.layers.6.layer_norm2.bias', 'text_model.encoder.layers.5.layer_norm1.bias', 'text_model.encoder.layers.3.layer_norm1.bias', 'text_model.encoder.layers.5.layer_norm2.weight', 'text_model.encoder.layers.10.self_attn.v_proj.weight', 'text_model.encoder.layers.2.layer_norm1.bias', 'text_model.encoder.layers.0.self_attn.v_proj.bias', 'text_model.encoder.layers.4.mlp.fc1.weight', 'text_model.encoder.layers.6.mlp.fc1.weight', 'text_model.encoder.layers.8.mlp.fc1.weight', 'text_model.encoder.layers.6.self_attn.k_proj.bias', 'text_model.encoder.layers.7.self_attn.v_proj.bias', 'text_model.encoder.layers.11.layer_norm1.bias', 'text_model.final_layer_norm.bias', 'text_model.encoder.layers.1.mlp.fc2.weight', 'text_model.encoder.layers.0.self_attn.k_proj.bias', 'text_model.encoder.layers.1.mlp.fc1.bias', 'text_model.encoder.layers.2.mlp.fc2.bias', 'text_model.encoder.layers.9.self_attn.v_proj.weight', 'text_model.embeddings.token_embedding.weight', 'text_model.encoder.layers.2.mlp.fc1.weight', 'text_model.encoder.layers.2.self_attn.out_proj.weight', 'text_model.encoder.layers.10.self_attn.q_proj.bias', 'text_model.encoder.layers.5.mlp.fc1.weight', 'text_model.encoder.layers.4.layer_norm1.bias', 'text_model.encoder.layers.3.layer_norm2.bias', 'text_model.encoder.layers.6.mlp.fc2.weight', 'text_model.encoder.layers.0.layer_norm2.weight', 'text_model.encoder.layers.0.self_attn.out_proj.bias', 'text_model.encoder.layers.4.self_attn.out_proj.bias', 'text_model.encoder.layers.2.self_attn.out_proj.bias', 'text_model.encoder.layers.3.self_attn.out_proj.weight', 'text_model.encoder.layers.4.self_attn.k_proj.bias', 'text_model.encoder.layers.11.self_attn.q_proj.weight', 'text_model.encoder.layers.6.self_attn.out_proj.weight', 'text_model.encoder.layers.5.self_attn.q_proj.bias', 'text_model.encoder.layers.8.mlp.fc2.bias', 'text_model.encoder.layers.11.self_attn.k_proj.bias', 'text_model.encoder.layers.7.mlp.fc1.weight', 'text_model.encoder.layers.2.mlp.fc2.weight', 'text_model.encoder.layers.0.mlp.fc2.bias', 'text_model.encoder.layers.6.mlp.fc1.bias', 'text_model.encoder.layers.1.layer_norm2.weight', 'text_model.encoder.layers.3.self_attn.k_proj.weight', 'text_model.encoder.layers.0.mlp.fc2.weight', 'text_model.encoder.layers.6.layer_norm1.weight', 'text_model.encoder.layers.6.layer_norm1.bias', 'text_model.encoder.layers.11.self_attn.q_proj.bias', 'text_model.encoder.layers.2.layer_norm2.bias', 'text_model.encoder.layers.8.layer_norm1.weight', 'text_model.encoder.layers.7.layer_norm2.weight', 'text_model.encoder.layers.7.mlp.fc2.weight', 'text_model.encoder.layers.11.mlp.fc2.weight', 'text_model.encoder.layers.10.self_attn.k_proj.bias', 'text_model.encoder.layers.9.self_attn.out_proj.bias', 'text_model.encoder.layers.8.mlp.fc2.weight', 'text_model.encoder.layers.3.layer_norm1.weight', 'text_model.encoder.layers.3.mlp.fc1.weight', 'text_model.encoder.layers.10.mlp.fc2.bias', 'text_model.encoder.layers.6.self_attn.q_proj.weight', 'text_model.encoder.layers.9.layer_norm1.weight', 'text_model.encoder.layers.3.self_attn.out_proj.bias', 'text_model.encoder.layers.0.layer_norm2.bias', 'text_model.encoder.layers.0.self_attn.q_proj.bias', 'text_model.encoder.layers.9.layer_norm1.bias', 'text_model.encoder.layers.11.self_attn.v_proj.weight', 'text_model.encoder.layers.0.self_attn.q_proj.weight', 'text_model.encoder.layers.2.self_attn.q_proj.weight', 'text_model.encoder.layers.6.self_attn.k_proj.weight', 'text_model.encoder.layers.5.self_attn.q_proj.weight', 'text_model.encoder.layers.7.layer_norm1.bias', 'text_model.encoder.layers.10.mlp.fc1.bias', 'text_model.encoder.layers.7.self_attn.out_proj.bias', 'text_model.encoder.layers.10.layer_norm1.weight', 'text_model.encoder.layers.10.layer_norm2.bias', 'text_model.encoder.layers.8.self_attn.k_proj.bias', 'text_model.encoder.layers.1.self_attn.v_proj.weight', 'text_model.encoder.layers.3.layer_norm2.weight', 'text_model.embeddings.position_ids', 'text_model.encoder.layers.7.self_attn.q_proj.bias', 'text_model.encoder.layers.4.self_attn.q_proj.weight', 'text_model.encoder.layers.1.self_attn.k_proj.bias', 'text_model.encoder.layers.4.mlp.fc1.bias', 'text_model.encoder.layers.1.self_attn.v_proj.bias', 'text_model.encoder.layers.3.mlp.fc2.weight', 'text_model.encoder.layers.4.self_attn.v_proj.weight', 'text_model.encoder.layers.0.mlp.fc1.bias', 'text_model.encoder.layers.7.self_attn.k_proj.bias', 'text_model.encoder.layers.8.self_attn.q_proj.weight', 'text_model.encoder.layers.0.layer_norm1.bias', 'text_model.encoder.layers.11.mlp.fc1.weight', 'text_model.encoder.layers.0.self_attn.v_proj.weight', 'text_model.encoder.layers.11.self_attn.out_proj.bias', 'text_model.encoder.layers.4.mlp.fc2.bias', 'text_model.encoder.layers.10.self_attn.out_proj.bias', 'text_model.encoder.layers.7.self_attn.k_proj.weight', 'text_model.encoder.layers.2.self_attn.k_proj.weight', 'text_model.encoder.layers.1.layer_norm1.bias', 'text_model.encoder.layers.8.mlp.fc1.bias', 'text_model.encoder.layers.8.layer_norm2.weight', 'text_model.encoder.layers.3.self_attn.v_proj.weight', 'text_model.encoder.layers.0.self_attn.out_proj.weight', 'text_model.encoder.layers.2.mlp.fc1.bias', 'text_model.encoder.layers.10.self_attn.v_proj.bias', 'text_model.encoder.layers.5.self_attn.v_proj.bias', 'text_model.encoder.layers.1.self_attn.q_proj.weight', 'text_model.encoder.layers.7.mlp.fc1.bias', 'text_model.encoder.layers.9.layer_norm2.bias', 'text_model.encoder.layers.2.self_attn.k_proj.bias', 'text_model.encoder.layers.9.self_attn.q_proj.bias', 'text_model.encoder.layers.4.layer_norm1.weight', 'text_model.encoder.layers.9.self_attn.k_proj.weight', 'text_model.encoder.layers.7.self_attn.q_proj.weight', 'text_model.encoder.layers.7.mlp.fc2.bias', 'text_model.encoder.layers.3.self_attn.q_proj.bias', 'text_model.encoder.layers.1.self_attn.q_proj.bias', 'text_model.encoder.layers.8.self_attn.out_proj.bias', 'text_model.encoder.layers.6.layer_norm2.weight', 'text_model.encoder.layers.5.mlp.fc2.weight']
  - This IS expected if you are initializing CLIPVisionModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  - This IS NOT expected if you are initializing CLIPVisionModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  [2024-07-24 19:46:20,610] Number of newly added tokens: 7
  [2024-07-24 19:46:20,611] Pre-training stage
  [2024-07-24 19:46:20,615] BUILDING PROCESSOR 1: clip_image
  Using /home/zhangzy/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
  Detected CUDA files, patching ldflags
  Emitting ninja build file /home/zhangzy/.cache/torch_extensions/py38_cu117/cpu_adam/build.ninja...
  Building extension module cpu_adam...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  Loading extension module cpu_adam...
  Time to load cpu_adam op: 3.719231367111206 seconds
  Using /home/zhangzy/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
  Emitting ninja build file /home/zhangzy/.cache/torch_extensions/py38_cu117/utils/build.ninja...
  Building extension module utils...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  Loading extension module utils...
  Time to load utils op: 0.8467190265655518 seconds
  [1]    156387 segmentation fault (core dumped)  python train_ullava_core.py --cfg_path ./configs/train/ullava_core_stage1.yam
VeritasXu commented 1 month ago

Hello, please add my wechat: jinxu_95 Ps. I will update the solution after discussion.

VeritasXu commented 1 month ago

solved, deepspeed install error. One can disable deepspeed for debugging.

Ace-blue commented 1 month ago

solved, deepspeed install error. One can disable deepspeed for debugging.

I installed deepspeed by this repo and works well!