Device side assert triggered on AWQ Mistral converted model

kdcyberdude commented 4 months ago

I have converted the TheBloke/Starling-LM-7B-alpha-AWQ model using the following command - python tools/convert_HF.py --model_dir TheBloke/Starling-LM-7B-alpha-AWQ --output ./Starling-LM-7B-alpha-AWQ-onmt/ --format pytorch --nshards 1

And I am not able to run the inference on the converted model. Getting the following error - Command I am using to run - python translate.py --config ./Starling-LM-7B-alpha-AWQ-onmt/inference.yaml --src ./input_prompt.txt --output ./output.txt input_prompt.txt content - GPT-4 User: How do you manage stress?<|end_of_turn|>GPT4 Assistant:

Traceback (most recent call last):
  File "/mnt/sea/c2/OpenNMT-py/translate.py", line 6, in <module>
    main()
  File "/mnt/sea/c2/OpenNMT-py/onmt/bin/translate.py", line 47, in main
    translate(opt)
  File "/mnt/sea/c2/OpenNMT-py/onmt/bin/translate.py", line 22, in translate
    _, _ = engine.infer_file()
  File "/mnt/sea/c2/OpenNMT-py/onmt/inference_engine.py", line 35, in infer_file
    scores, preds = self._translate(infer_iter)
  File "/mnt/sea/c2/OpenNMT-py/onmt/inference_engine.py", line 159, in _translate
    scores, preds = self.translator._translate(
  File "/mnt/sea/c2/OpenNMT-py/onmt/translate/translator.py", line 496, in _translate
    batch_data = self.translate_batch(batch, attn_debug)
  File "/mnt/sea/c2/OpenNMT-py/onmt/translate/translator.py", line 1067, in translate_batch
    return self._translate_batch_with_strategy(batch, decode_strategy)
  File "/mnt/sea/c2/OpenNMT-py/onmt/translate/translator.py", line 1149, in _translate_batch_with_strategy
    decode_strategy.advance(log_probs, attn)
  File "/mnt/sea/c2/OpenNMT-py/onmt/translate/beam_search.py", line 432, in advance
    super(BeamSearchLM, self).advance(log_probs, attn)
  File "/mnt/sea/c2/OpenNMT-py/onmt/translate/beam_search.py", line 379, in advance
    self.is_finished_list = self.topk_ids.eq(self.eos).tolist()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [228,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [228,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [228,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [228,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

And I have one more question - I am not able to understand the example prompts provided for the mistral model - like the tokens used over there i.e. ｟newline｠. I'd appreciate it if you could provide some explanation or documentation link for this.

vince62s commented 4 months ago

maybe use the forum instead, and give more details like the yaml content. https://forum.opennmt.net/latest

kdcyberdude commented 4 months ago

My inference.yaml config file content -

transforms: [sentencepiece]

src_subword_model: "Starling-LM-7B-alpha-AWQ-onmt/tokenizer.model"
tgt_subword_model: "Starling-LM-7B-alpha-AWQ-onmt/tokenizer.model"

model: "Starling-LM-7B-alpha-AWQ-onmt/Starling-LM-7B-alpha-AWQ-onmt.pt"

seed: 13
max_length: 256
gpu: 0
batch_type: sents
batch_size: 60
world_size: 1
gpu_ranks: [0]

precision: fp16
beam_size: 1
n_best: 1
profile: false
report_time: true
src: None

Added the topic to the forum as well - https://forum.opennmt.net/t/device-side-assert-triggered-on-awq-mistral-converted-model/5656

OpenNMT / OpenNMT-py

Device side assert triggered on AWQ Mistral converted model #2562