CUDA extension not installed Error while running readme_example.py

Hi. While trying to run the readme_example.py on A100 80GB I get the following error after waiting for around 10 minutes:


CUDA_VISIBLE_DEVICES=1 python readme_example.py 
Do not detect pre-installed ops, use JIT mode
/home/siavashi/mohamamd/tests/MoE-Infinity/venv/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Fetching 10 files: 100%|███████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 5257.34it/s]
Using /home/siavashi/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Emitting ninja build file /home/siavashi/.cache/torch_extensions/py39_cu121/prefetch/build.ninja...
Building extension module prefetch...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module prefetch...
Time to load prefetch op: 2.6730473041534424 seconds
SPDLOG_LEVEL : (null)
2024-06-11 14:24:16.019 INFO Create ArcherAioThread for thread: , 0
2024-06-11 14:24:16.019 INFO Index file, ./moe-infinity/archer_index,  does not exist, creating
2024-06-11 14:24:16.019 INFO Index file size , 0
2024-06-11 14:24:16.020 INFO Device count , 1
2024-06-11 14:24:16.020 INFO Enabled peer access for all devices
Creating model from scratch ...
Loading checkpoint files: 100%|███████████████████████████████████████████████████████████████████████████████| 1/1 [06:20<00:00, 380.08s/it]
Model create:  19%|███████████████▋                                                                     | 847/4578 [00:00<00:01, 2952.43it/s]CUDA extension not installed.
CUDA extension not installed.
/home/siavashi/mohamamd/tests/MoE-Infinity/venv/lib/python3.9/site-packages/transformers/modeling_utils.py:4481: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
  warnings.warn(
MixtralConfig {
  "_name_or_path": "TheBloke/Mixtral-8x7B-v0.1-GPTQ",
  "architectures": [
    "MixtralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mixtral",
  "num_attention_heads": 32,
  "num_experts_per_tok": 2,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "num_local_experts": 8,
  "output_router_logits": false,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "quantization_config": {
    "bits": 4,
    "damp_percent": 0.1,
    "desc_act": true,
    "disable_exllama": true,
    "group_size": -1,
    "model_file_base_name": "model",
    "model_name_or_path": null,
    "modules_in_block_to_quantize": [
      [
        "self_attn.k_proj",
        "self_attn.v_proj",
        "self_attn.q_proj"
      ],
      [
        "self_attn.o_proj"
      ],
      [
        "block_sparse_moe.experts.0.w1",
        "block_sparse_moe.experts.0.w2",
        "block_sparse_moe.experts.0.w3"
      ],
      [
        "block_sparse_moe.experts.1.w1",
        "block_sparse_moe.experts.1.w2",
        "block_sparse_moe.experts.1.w3"
      ],
      [
        "block_sparse_moe.experts.2.w1",
        "block_sparse_moe.experts.2.w2",
        "block_sparse_moe.experts.2.w3"
      ],
      [
        "block_sparse_moe.experts.3.w1",
        "block_sparse_moe.experts.3.w2",
        "block_sparse_moe.experts.3.w3"
      ],
      [
        "block_sparse_moe.experts.4.w1",
        "block_sparse_moe.experts.4.w2",
        "block_sparse_moe.experts.4.w3"
      ],
      [
        "block_sparse_moe.experts.5.w1",
        "block_sparse_moe.experts.5.w2",
        "block_sparse_moe.experts.5.w3"
      ],
      [
        "block_sparse_moe.experts.6.w1",
        "block_sparse_moe.experts.6.w2",
        "block_sparse_moe.experts.6.w3"
      ],
      [
        "block_sparse_moe.experts.7.w1",
        "block_sparse_moe.experts.7.w2",
        "block_sparse_moe.experts.7.w3"
      ]
    ],
    "quant_method": "gptq",
    "sym": true,
    "true_sequential": true,
    "use_exllama": false
  },
  "rms_norm_eps": 1e-05,
  "rope_theta": 1000000.0,
  "router_aux_loss_coef": 0.02,
  "router_jitter_noise": 0.0,
  "sliding_window": 4096,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.41.2",
  "use_cache": true,
  "vocab_size": 32000
}

ArcherTaskPool destructor

CUDA 12.4 is installed and also added to the path correctly. Also, CUDA_HOME is set. I do not see this issue when running with hugging face transformers.

Any idea what the problem might be?

It works fine with mistralai/Mixtral-8x7B-Instruct-v0.1; however, it is incredibly slow in generating even 10 tokens, taking over 40 minutes. Both DRAM and GPU memory usage increase at a very slow rate.

The model creation gets stuck at 94% and remains there for over 40 minutes until it finishes.

Model create:  94%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎            | 930/994 [00:15<00:00, 2517.89it/s]

It works fine with mistralai/Mixtral-8x7B-Instruct-v0.1; however, it is incredibly slow in generating even 10 tokens, taking over 40 minutes. Both DRAM and GPU memory usage increase at a very slow rate.

The model creation gets stuck at 94% and remains there for over 40 minutes until it finishes.
Model create:  94%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎            | 930/994 [00:15<00:00, 2517.89it/s]

We have some bug with the progress par, we are working on fixing the confusion. The first generation is slow since the parameters needs to read from disk initially, all cache needs to warm up first.

@drunkcoding I just wanted to follow up and let you know that I've observed a consistent delay in multiple runs. The initial run does take longer, but subsequent runs also experience significant execution time. As an example, here are the timing results for the third run of 'TheBloke/Mixtral-8x7B-v0.1-GPTQ':

real    8m13.200s
user    42m47.586s
sys     9m18.702s

Considering the setup of A100 80GB + 256GB of DRAM, is it normal to observe these timings based on your own experiments?

@drunkcoding I just wanted to follow up and let you know that I've observed a consistent delay in multiple runs. The initial run does take longer, but subsequent runs also experience significant execution time. As an example, here are the timing results for the third run of 'TheBloke/Mixtral-8x7B-v0.1-GPTQ':
real    8m13.200s
user    42m47.586s
sys     9m18.702s
Considering the setup of A100 80GB + 256GB of DRAM, is it normal to observe these timings based on your own experiments?

To keep us on the same page, multiple runs does not mean run the script multiple times, it means that feed more inputs to the model while the frameowrk is running. After the second input, it is very likely that we can observe the latency drop

for input in inputs
  model.generate(input)

TorchMoE / MoE-Infinity

CUDA extension not installed Error while running readme_example.py #24