Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.13k stars 70 forks source link

KeyError: 'type' from torch.compile executor #1040

Closed tfogal closed 3 weeks ago

tfogal commented 3 weeks ago

🐛 Bug

We appear to be missing a key in a map internal to the torch.compile executor:

HYDRA_FULL_ERROR=1 \
THUNDER_ANNOTATE_TRACES=1 \
NEMO_THUNDER_NEVA=dynamo \
python3 ./examples/multimodal/multimodal_llm/neva/neva_pretrain.py trainer.precision=bf16-mixed model.megatron_amp_O2=True model.mcore_gpt=False trainer.num_nodes=1 trainer.devices=1 trainer.val_check_interval=10 trainer.limit_val_batches=5 trainer.log_every_n_steps=1 ++exp_manager.max_time_per_run=00:00:03:00 trainer.max_steps=20 model.micro_batch_size=2 model.global_batch_size=4 model.tensor_model_parallel_size=1 model.pipeline_model_parallel_size=1 exp_manager.create_checkpoint_callback=False model.data.data_path=./data/multimodal/tiny-neva/dummy.json model.data.image_folder=./data/multimodal/tiny-neva/images model.tokenizer.library=sentencepiece model.tokenizer.model=./data/multimodal/tiny-neva/tokenizer_add_special.model model.num_layers=2 model.hidden_size=5120 model.ffn_hidden_size=13824 model.num_attention_heads=40 model.normalization=rmsnorm model.data.num_workers=0 model.data.conv_template=llama_2 model.mm_cfg.vision_encoder.from_pretrained=openai/clip-vit-large-patch14 model.mm_cfg.llm.from_pretrained=null model.use_flash_attention=false exp_manager.exp_dir=./foo-neva-train
[NeMo W 2024-08-23 17:49:38 nemo_logging:349] /home/tfogal/env/lib/python3.10/site-packages/megatron/core/tensor_parallel/layers.py:280: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
      def forward(ctx, input, weight, bias, allreduce_dgrad):

[NeMo W 2024-08-23 17:49:38 nemo_logging:349] /home/tfogal/env/lib/python3.10/site-packages/megatron/core/tensor_parallel/layers.py:290: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
      def backward(ctx, grad_output):

[NeMo W 2024-08-23 17:49:38 nemo_logging:349] /home/tfogal/env/lib/python3.10/site-packages/megatron/core/tensor_parallel/layers.py:380: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
      def forward(

[NeMo W 2024-08-23 17:49:38 nemo_logging:349] /home/tfogal/env/lib/python3.10/site-packages/megatron/core/tensor_parallel/layers.py:419: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
      def backward(ctx, grad_output):

`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
[NeMo W 2024-08-23 17:50:23 nemo_logging:349] /home/tfogal/env/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(

[NeMo I 2024-08-23 17:50:23 neva_pretrain:89] 

    ************** Experiment configuration ***********
[NeMo I 2024-08-23 17:50:23 neva_pretrain:90] 
    name: nemo_neva
    restore_from_path: null
    trainer:
      devices: 1
      num_nodes: 1
      accelerator: gpu
      precision: bf16-mixed
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: -1
      max_steps: 20
      log_every_n_steps: 1
      val_check_interval: 10
      check_val_every_n_epoch: null
      limit_val_batches: 5
      limit_test_batches: 500
      accumulate_grad_batches: 1
      gradient_clip_val: 1.0
      benchmark: false
      enable_model_summary: false
    exp_manager:
      explicit_log_dir: null
      exp_dir: ./foo-neva-train
      name: nemo_neva
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      resume_from_checkpoint: ${model.resume_from_checkpoint}
      create_checkpoint_callback: false
      checkpoint_callback_params:
        monitor: val_loss
        save_top_k: 10
        mode: min
        always_save_nemo: false
        save_nemo_on_train_end: false
        filename: megatron_clip--{val_loss:.2f}-{step}-{consumed_samples}
        model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}
      ema:
        enable: false
        decay: 0.9999
        validate_original_weights: false
        every_n_steps: 1
        cpu_offload: false
      max_time_per_run: 00:00:03:00
    model:
      precision: ${trainer.precision}
      micro_batch_size: 2
      global_batch_size: 4
      tensor_model_parallel_size: 1
      pipeline_model_parallel_size: 1
      virtual_pipeline_model_parallel_size: null
      restore_from_path: null
      mm_cfg:
        llm:
          from_pretrained: null
          freeze: true
          model_type: llama_2
        vision_encoder:
          from_pretrained: openai/clip-vit-large-patch14
          from_hf: true
          patch_dim: 14
          hidden_size: 1024
          vision_select_layer: -2
          class_token_length: 1
          freeze: true
        pretrain_mm_mlp_adapter: null
        mm_mlp_adapter_type: linear
        use_im_start_end: false
      mcore_gpt: false
      encoder_seq_length: 4096
      max_position_embeddings: ${.encoder_seq_length}
      position_embedding_type: rope
      num_layers: 2
      hidden_size: 5120
      ffn_hidden_size: 13824
      num_attention_heads: 40
      init_method_std: 0.014
      use_scaled_init_method: true
      hidden_dropout: 0.0
      attention_dropout: 0.0
      ffn_dropout: 0.0
      kv_channels: null
      apply_query_key_layer_scaling: true
      normalization: rmsnorm
      layernorm_epsilon: 1.0e-05
      do_layer_norm_weight_decay: false
      pre_process: true
      post_process: true
      persist_layer_norm: true
      bias: false
      activation: fast-swiglu
      headscale: false
      transformer_block_type: pre_ln
      normalize_attention_scores: true
      rotary_percentage: 1.0
      attention_type: multihead
      share_embeddings_and_output_weights: false
      overlap_p2p_comm: false
      batch_p2p_comm: true
      seq_len_interpolation_factor: null
      num_query_groups: null
      use_flash_attention: false
      activations_checkpoint_granularity: null
      activations_checkpoint_method: null
      activations_checkpoint_num_layers: null
      num_micro_batches_with_partial_activation_checkpoints: null
      activations_checkpoint_layers_per_pipeline: null
      sequence_parallel: false
      native_amp_init_scale: 4294967296
      native_amp_growth_interval: 1000
      hysteresis: 2
      fp32_residual_connection: false
      fp16_lm_cross_entropy: false
      masked_softmax_fusion: true
      bias_dropout_add_fusion: false
      use_cpu_initialization: false
      onnx_safe: false
      gradient_accumulation_fusion: false
      openai_gelu: false
      bias_activation_fusion: false
      megatron_legacy: false
      transformer_engine: false
      fp8: false
      fp8_e4m3: false
      fp8_hybrid: false
      fp8_margin: 0
      fp8_interval: 1
      fp8_amax_history_len: 1
      fp8_amax_compute_algo: most_recent
      use_emha: false
      megatron_amp_O2: true
      async_grad_allreduce: false
      grad_allreduce_chunk_size_mb: 125
      grad_div_ar_fusion: true
      seed: 1234
      resume_from_checkpoint: null
      apex_transformer_log_level: 30
      gradient_as_bucket_view: true
      tokenizer:
        library: sentencepiece
        type: null
        model: ./data/multimodal/tiny-neva/tokenizer_add_special.model
        vocab_file: null
        merge_file: null
        delimiter: null
        sentencepiece_legacy: false
        additional_special_tokens: null
      data:
        packed_sequence: false
        num_workers: 0
        dataloader_type: cyclic
        data_path: ./data/multimodal/tiny-neva/dummy.json
        lazy_preprocess: true
        is_multimodal: true
        media_type: image
        sep_image_conv_front: false
        image_token_len: 256
        conv_template: llama_2
        image_folder: ./data/multimodal/tiny-neva/images
        image_aspect_ratio: square
      nsys_profile:
        enabled: false
        start_step: 10
        end_step: 10
        ranks:
        - 0
        gen_shape: false
      optim:
        name: fused_adam
        lr: 0.002
        weight_decay: 0.0
        betas:
        - 0.9
        - 0.95
        sched:
          name: CosineAnnealing
          warmup_steps: 140
          constant_steps: 0
          min_lr: 2.0e-05

[NeMo W 2024-08-23 17:50:23 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2024-08-23 17:50:23 exp_manager:773] No version folders would be created under the log folder as 'resume_if_exists' is enabled.
[NeMo W 2024-08-23 17:50:23 exp_manager:630] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :foo-neva-train/nemo_neva/checkpoints. Training from scratch.
[NeMo I 2024-08-23 17:50:23 exp_manager:396] Experiments will be logged at foo-neva-train/nemo_neva
[NeMo I 2024-08-23 17:50:23 exp_manager:856] TensorboardLogger has been set up
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: deterministic_mode in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_overlap_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_overlap_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_overlap_rs_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: cross_entropy_loss_fusion in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_overlap_disable_qkv in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_overlap_disable_fc1 in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: defer_embedding_wgrad_compute in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: wgrad_deferral_limit in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: pipeline_model_parallel_split_rank in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:25 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo I 2024-08-23 17:50:25 megatron_init:263] Rank 0 has data parallel group : [0]
[NeMo I 2024-08-23 17:50:25 megatron_init:269] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-08-23 17:50:25 megatron_init:274] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2024-08-23 17:50:25 megatron_init:277] Ranks 0 has data parallel rank: 0
[NeMo I 2024-08-23 17:50:25 megatron_init:285] Rank 0 has context parallel group: [0]
[NeMo I 2024-08-23 17:50:25 megatron_init:288] All context parallel group ranks: [[0]]
[NeMo I 2024-08-23 17:50:25 megatron_init:289] Ranks 0 has context parallel rank: 0
[NeMo I 2024-08-23 17:50:25 megatron_init:296] Rank 0 has model parallel group: [0]
[NeMo I 2024-08-23 17:50:25 megatron_init:297] All model parallel group ranks: [[0]]
[NeMo I 2024-08-23 17:50:25 megatron_init:306] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-08-23 17:50:25 megatron_init:310] All tensor model parallel group ranks: [[0]]
[NeMo I 2024-08-23 17:50:25 megatron_init:311] Rank 0 has tensor model parallel rank: 0
[NeMo I 2024-08-23 17:50:25 megatron_init:331] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2024-08-23 17:50:25 megatron_init:343] Rank 0 has embedding group: [0]
[NeMo I 2024-08-23 17:50:25 megatron_init:349] All pipeline model parallel group ranks: [[0]]
[NeMo I 2024-08-23 17:50:25 megatron_init:350] Rank 0 has pipeline model parallel rank 0
[NeMo I 2024-08-23 17:50:25 megatron_init:351] All embedding group ranks: [[0]]
[NeMo I 2024-08-23 17:50:25 megatron_init:352] Rank 0 has embedding rank: 0
24-08-23 17:50:26 - PID:2193 - rank:(0, 0, 0, 0) - microbatches.py:39 - INFO - setting number of micro-batches to constant 2
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: deterministic_mode in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_overlap_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_overlap_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_overlap_rs_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: cross_entropy_loss_fusion in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_overlap_disable_qkv in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_overlap_disable_fc1 in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: defer_embedding_wgrad_compute in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: wgrad_deferral_limit in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: pipeline_model_parallel_split_rank in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo I 2024-08-23 17:50:26 tokenizer_utils:188] Getting SentencePiece with model: /home/tfogal/dev/nemo/data/multimodal/tiny-neva/tokenizer_add_special.model
[NeMo I 2024-08-23 17:50:26 megatron_base_model:584] Padded vocab_size: 32128, original vocab_size: 32008, dummy tokens: 120.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: deterministic_mode in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_overlap_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_overlap_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_overlap_rs_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: cross_entropy_loss_fusion in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_overlap_disable_qkv in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: tp_comm_overlap_disable_fc1 in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: defer_embedding_wgrad_compute in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: wgrad_deferral_limit in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: pipeline_model_parallel_split_rank in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:1158] The model: MegatronNevaModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:498] apply_query_key_layer_scaling is only enabled when using FP16, setting it to False and setting NVTE_APPLY_QK_LAYER_SCALING=0
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: activation_func_fp8_input_store in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: num_moe_experts in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: window_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: qk_layernorm in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: test_mode in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: calculate_per_token_loss in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: memory_efficient_layer_norm in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: fp8_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: fp8_dot_product_attention in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: fp8_multi_head_attention in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: moe_router_load_balancing_type in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: moe_router_topk in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: moe_router_pre_softmax in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: moe_grouped_gemm in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: moe_aux_loss_coeff in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: moe_z_loss_coeff in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: moe_input_jitter_eps in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: moe_token_dropping in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: moe_token_dispatcher_type in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: moe_per_layer_logging in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: moe_expert_capacity_factor in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: moe_pad_expert_input_to_capacity in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: moe_token_drop_policy in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: moe_layer_recompute in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: clone_scatter_output_in_embedding in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: disable_parameter_transpose_cache in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: enable_cuda_graph in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_base_model:556] The model: MegatronNevaModel() does not have field.name: config_logger_dir in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-23 17:50:26 megatron_gpt_model:327] megatron_amp_O2 is enabled but transformer-engine is not.
[NeMo W 2024-08-23 17:50:26 deprecated:94] 

    ************************************************************************
    ************************************************************************
    *****  GPTModel is deprecated. Please, use McoreGPTModel instead.  *****
    ************************************************************************
    ************************************************************************

[NeMo W 2024-08-23 17:50:26 deprecated:95] Waiting for 2 seconds before this message disappears.
[NeMo W 2024-08-23 17:50:28 nemo_logging:349] /home/tfogal/env/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
      warnings.warn(

[NeMo I 2024-08-23 17:50:29 neva_model:604] Neva model initialized with 0 trainable parameters
[NeMo W 2024-08-23 17:50:29 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:161: You have overridden `MegatronNevaModel.configure_sharded_model` which is deprecated. Please override the `configure_model` hook instead. Instantiation with the newer hook will be created on the device right away and have the right data type depending on the precision setting in the Trainer.

[NeMo W 2024-08-23 17:50:29 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:143: You are using the `dataloader_iter` step flavor. If you consume the iterator more than once per step, the `batch_idx` argument in any hook that takes it will not match with the batch index of the last batch consumed. This might have unforeseen effects on callbacks or code that expects to get the correct index. This will also not work well with gradient accumulation. This feature is very experimental and subject to change. Here be dragons.

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

[NeMo I 2024-08-23 17:50:29 neva_model:949] Pipeline model parallel rank: 0, Tensor model parallel rank: 0, Number of model parameters on device: 1.27e+09. Total number of model parameters: 1.27e+09.
[NeMo I 2024-08-23 17:50:29 neva_model:1013] Building Neva datasets.
Formatting inputs...Skip in lazy mode
[NeMo I 2024-08-23 17:50:30 megatron_gpt_model:1631] Setting up train dataloader with len(len(self._train_ds)): 60 and consumed samples: 0
[NeMo I 2024-08-23 17:50:30 neva_model:1030] Building dataloader with consumed samples: 0
[NeMo I 2024-08-23 17:50:30 data_samplers:76] Instantiating MegatronPretrainingSampler with total_samples: 60 and consumed_samples: 0
[NeMo I 2024-08-23 17:50:30 megatron_gpt_model:1639] Setting up validation dataloader with len(len(self._validation_ds)): 60 and consumed samples: 0
[NeMo I 2024-08-23 17:50:30 neva_model:1030] Building dataloader with consumed samples: 0
[NeMo I 2024-08-23 17:50:30 data_samplers:76] Instantiating MegatronPretrainingSampler with total_samples: 60 and consumed_samples: 0
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
[NeMo I 2024-08-23 17:50:30 modelPT:770] Optimizer config = FusedAdam (
    Parameter Group 0
        betas: [0.9, 0.95]
        bias_correction: True
        eps: 1e-08
        is_expert: False
        lr: 0.002
        weight_decay: 0.0

    Parameter Group 1
        betas: [0.9, 0.95]
        bias_correction: True
        eps: 1e-08
        is_expert: False
        lr: 0.002
        weight_decay: 0.0
    )
[NeMo I 2024-08-23 17:50:30 lr_scheduler:923] Scheduler "<nemo.core.optim.lr_scheduler.CosineAnnealing object at 0x7fb677baa560>" 
    will be used during training (effective maximum steps = 20) - 
    Parameters : 
    (warmup_steps: 140
    constant_steps: 0
    min_lr: 2.0e-05
    max_steps: 20
    )
[NeMo I 2024-08-23 17:50:30 lr_scheduler:923] Scheduler "<nemo.core.optim.lr_scheduler.CosineAnnealing object at 0x7fb677ba98d0>" 
    will be used during training (effective maximum steps = 20) - 
    Parameters : 
    (warmup_steps: 140
    constant_steps: 0
    min_lr: 2.0e-05
    max_steps: 20
    )

Sanity Checking: |          | 0/? [00:00<?, ?it/s][NeMo W 2024-08-23 17:50:30 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=31` in the `DataLoader` to improve performance.

[NeMo W 2024-08-23 17:50:30 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:148: Found `dataloader_iter` argument in the `validation_step`. Note that the support for this signature is experimental and the behavior is subject to change.

Sanity Checking:   0%|          | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]thunder_graphs=4 / num_graphs=4
Error executing job with overrides: ['trainer.precision=bf16-mixed', 'model.megatron_amp_O2=True', 'model.mcore_gpt=False', 'trainer.num_nodes=1', 'trainer.devices=1', 'trainer.val_check_interval=10', 'trainer.limit_val_batches=5', 'trainer.log_every_n_steps=1', '++exp_manager.max_time_per_run=00:00:03:00', 'trainer.max_steps=20', 'model.micro_batch_size=2', 'model.global_batch_size=4', 'model.tensor_model_parallel_size=1', 'model.pipeline_model_parallel_size=1', 'exp_manager.create_checkpoint_callback=False', 'model.data.data_path=./data/multimodal/tiny-neva/dummy.json', 'model.data.image_folder=./data/multimodal/tiny-neva/images', 'model.tokenizer.library=sentencepiece', 'model.tokenizer.model=./data/multimodal/tiny-neva/tokenizer_add_special.model', 'model.num_layers=2', 'model.hidden_size=5120', 'model.ffn_hidden_size=13824', 'model.num_attention_heads=40', 'model.normalization=rmsnorm', 'model.data.num_workers=0', 'model.data.conv_template=llama_2', 'model.mm_cfg.vision_encoder.from_pretrained=openai/clip-vit-large-patch14', 'model.mm_cfg.llm.from_pretrained=null', 'model.use_flash_attention=false', 'exp_manager.exp_dir=./foo-neva-train']
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/tfogal/dev/nemo/./examples/multimodal/multimodal_llm/neva/neva_pretrain.py", line 118, in <module>
[rank0]:     main()
[rank0]:   File "/home/tfogal/dev/nemo/nemo/core/config/hydra_runner.py", line 129, in wrapper
[rank0]:     _run_hydra(
[rank0]:   File "/home/tfogal/env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
[rank0]:     _run_app(
[rank0]:   File "/home/tfogal/env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
[rank0]:     run_and_report(
[rank0]:   File "/home/tfogal/env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
[rank0]:     raise ex
[rank0]:   File "/home/tfogal/env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
[rank0]:     return func()
[rank0]:   File "/home/tfogal/env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
[rank0]:     lambda: hydra.run(
[rank0]:   File "/home/tfogal/env/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
[rank0]:     _ = ret.return_value
[rank0]:   File "/home/tfogal/env/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
[rank0]:     raise self._return_value
[rank0]:   File "/home/tfogal/env/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
[rank0]:     ret.return_value = task_function(task_cfg)
[rank0]:   File "/home/tfogal/dev/nemo/./examples/multimodal/multimodal_llm/neva/neva_pretrain.py", line 109, in main
[rank0]:     trainer.fit(model)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 538, in fit
[rank0]:     call._call_and_handle_interrupt(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank0]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]:     return function(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 574, in _fit_impl
[rank0]:     self._run(model, ckpt_path=ckpt_path)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run
[rank0]:     results = self._run_stage()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1023, in _run_stage
[rank0]:     self._run_sanity_check()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1052, in _run_sanity_check
[rank0]:     val_loop.run()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 178, in _decorator
[rank0]:     return loop_run(self, *args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run
[rank0]:     self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step
[rank0]:     output = call._call_strategy_hook(trainer, hook_name, *step_args)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 319, in _call_strategy_hook
[rank0]:     output = fn(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 411, in validation_step
[rank0]:     return self.lightning_module.validation_step(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py", line 897, in validation_step
[rank0]:     return MegatronGPTModel.validation_step(self, dataloader_iter)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 1370, in validation_step
[rank0]:     loss = self.fwd_bwd_step(dataloader_iter, True, first_val_step)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py", line 665, in fwd_bwd_step
[rank0]:     return MegatronGPTModel.fwd_bwd_step(self, dataloader_iter, forward_only, first_val_step)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 684, in fwd_bwd_step
[rank0]:     losses_reduced_per_micro_batch = fwd_bwd_function(
[rank0]:   File "/home/tfogal/env/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 435, in forward_backward_no_pipelining
[rank0]:     output_tensor, num_tokens = forward_step(
[rank0]:   File "/home/tfogal/env/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 259, in forward_step
[rank0]:     output_tensor, loss_func = forward_step_func(data_iterator, model)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py", line 832, in fwd_output_and_loss_func
[rank0]:     output_tensor = model(**forward_args)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 469, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/nlp/modules/common/megatron/module.py", line 292, in forward
[rank0]:     outputs = self.module(*inputs, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py", line 466, in forward
[rank0]:     torch.cuda.nvtx.range_push("neva fwd")
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py", line 470, in torch_dynamo_resume_in_forward_at_466
[rank0]:     result = GPTModel.forward(self, *args, **kwargs)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/nlp/models/language_modeling/megatron/gpt_model.py", line 286, in forward
[rank0]:     lm_output = self.language_model(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/nlp/modules/common/megatron/language_model.py", line 764, in forward
[rank0]:     encoder_input = self.embedding(enc_input_ids, enc_position_ids, token_type_ids=token_type_ids)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/nlp/modules/common/megatron/language_model.py", line 348, in forward
[rank0]:     words_embeddings = self.word_embeddings(input_ids)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py", line 155, in forward
[rank0]:     return self.replace_media_embeddings(input_ids, words_embeddings, media)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py", line 195, in replace_media_embeddings
[rank0]:     media_features = self.encode_vision_x(media)  # b T F S(eq) H(idden)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py", line 195, in torch_dynamo_resume_in_replace_media_embeddings_at_195
[rank0]:     media_features = self.encode_vision_x(media)  # b T F S(eq) H(idden)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 636, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/core/module.py", line 80, in forward
[rank0]:     res = self._forward_fn(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/__init__.py", line 744, in fn_
[rank0]:     cache_entry, inps, pro_to_epi = get_computation_and_inputs(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/core/langctxs.py", line 136, in _fn
[rank0]:     result = fn(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/__init__.py", line 229, in cache_info_wrapper
[rank0]:     res = fn(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/__init__.py", line 675, in get_computation_and_inputs
[rank0]:     extraces = transform_for_execution(
[rank0]:   File "/home/tfogal/dev/thunder/thunder/common.py", line 623, in transform_for_execution
[rank0]:     extrace = executors.passes.transform_for_execution(dce_trace, executors_list)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/executors/passes.py", line 160, in transform_for_execution
[rank0]:     extrace = ex.fusion_pass(extrace)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/executors/torch_compile.py", line 177, in fusion_pass
[rank0]:     fusion_bsym: BoundSymbol = self.fuse(region, fusion_counter)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/executors/torch_compile.py", line 132, in fuse
[rank0]:     compiled: Callable = make_compiled(region.bound_symbols, sorted_unique_inputs, sorted_unique_outputs)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/executors/torch_compile.py", line 86, in make_compiled
[rank0]:     torch_trace = trace(inline_trace=False)(torch_interpreted_func, *sorted_unique_inputs)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/core/interpreter.py", line 1317, in fn_
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/common.py", line 564, in _trace
[rank0]:     result = fn(*proxyargs, **proxykwargs)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/executors/torch_compile.py", line 67, in torch_interpreted_func
[rank0]:     return eval_trace(region_trace, *args, symbol_mapper=to_torch_translator)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/core/transforms.py", line 1512, in eval_trace
[rank0]:     result = prim_func(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/executors/torch_compile.py", line 45, in _to_torch
[rank0]:     torch_op = torchex.opmap[bsym.sym.name]
[rank0]: KeyError: 'type'

To Reproduce

See bug #343 for NeVA setup information.

Code sample

Workaround

There's a workaround to just remove the executor:

      execs: list[thunder.extend.Executor] = [
        thunder.extend.get_executor("cudnn"),
        thunder.extend.get_executor("sdpa"),
        thunder.extend.get_executor("nvfuser"),
        thunder.extend.get_executor("torch"),
      ]
      fqn = thunder.jit(gm, executors=execs)

Environment

Additional context

cc @apaz-cli @tfogal

t-vi commented 3 weeks ago

More minimal repro:

def fn(a):
     return a.type('torch.DoubleTensor')
a = torch.randn(3)
jfn = thunder.jit(fn, executors=(thunder.executors.torch_compile.torch_compile_ex,))
jfn(a)

The trace before transform for execution is

# Constructed by Dead Code Elimination (took 0 milliseconds)
import thunder
import thunder.torch as ltorch
import torch
from thunder.executors.torchex import no_autocast

@torch.no_grad()
@no_autocast
def computation(a):
  # a: "cpu f32[2, 2]"

  # <ipython-input-18-c7e8b0e10321>:2:      return a.type('torch.DoubleTensor')
  t0 = ltorch.type(a, 'torch.DoubleTensor', False)  # t0: "cpu f64[2, 2]"
    # t0 = ltorch.to(a, devices.Device("cpu"), torch.float64, device=None, dtype=None, copy=False, memory_format=None)  # t0: "cpu f64[2, 2]"
      # t0 = prims.convert_element_type(a, dtypes.float64)  # t0: "cpu f64[2, 2]"
  return t0

but the torchex executor knows to resolve the type to the ltorch.to subsymbol whereas the torch_compile_ex tries to use the type symbol directly: After transform to execution with the torchex:

# Constructed by Transform for execution (took 0 milliseconds)
from torch import Tensor
import torch
from thunder.executors.torchex import no_autocast

@torch.no_grad()
@no_autocast
def computation(a):
  # a: "cpu f32[2, 2]"
  t0 = Tensor.to(a, copy=False, device=torch.device("cpu"), dtype=torch.float64)  # t0: "cpu f64[2, 2]"
    # t0 = ltorch.to(a, None, None, device=torch.device("cpu"), dtype=torch.float64, copy=False, memory_format=None)  # t0: "cpu f64[2, 2]"
      # t0 = prims.convert_element_type(a, dtypes.float64)  # t0: "cpu f64[2, 2]"

  # <ipython-input-18-c7e8b0e10321>:2:      return a.type('torch.DoubleTensor')
  return t0

So back in the day, @IvanYashchuk (?) added a comment:

https://github.com/Lightning-AI/lightning-thunder/blob/7c1f94ab64e0103ced14696da8fb1652e830ab6d/thunder/executors/torch_compile.py#L69-L85

Now I seem to get good results with just calling transform_for_execution with only the torchex as executors: PR #1041 does seem to work.