HandH1998 / QQQ

QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
https://arxiv.org/pdf/2406.09904
91 stars 8 forks source link

[QST] Scale factors and benchmarks #2

Closed jeromeku closed 4 months ago

jeromeku commented 5 months ago

Great paper and thanks for open sourcing the code.

A couple questions: 1) Is the benchmarking code in section 4 of the paper available (GEMM, FastFP16toInt8)? 2) In the per-group W4A8 kernel, why is there a need for an additional channel-wise scale factor in FusedDequantQuant? I.e., the Int4 weights are dequantized to FP16 using group-wise scale factors, then quantized to Int8 using an additional channel-wise scale then fed to Int8 GEMM. In contrast, in the channel-wise W4A8 kernel, the Int4 weights are directly converted to Int8 then fed to Int8 GEMM.

HandH1998 commented 5 months ago

@jeromeku Reply to your questions:

  1. For the w4a8 GEMM benchmark, you can try bench_w4a8.py in my repo https://github.com/HandH1998/marlin/tree/w4a8. For the FastFP16toInt8 benchmark, I provide an old version GEMM code in gist https://gist.github.com/HandH1998/b96922e0a0ab7da769fd93e34ffb068a, which is the baseline using the traditional instruction converting from fp16 to int8. You can put it in https://github.com/HandH1998/marlin/tree/w4a8, and do the benchmark.
  2. As there are multiple per-group scales in one channel of weight, which are not directly compatible with standard GEMM procedures, we have to do the conversion INT4 to FP16, then to INT8. For per-channel scale, we can split the per-channel scale with GEMM using s_a * A * W * s_w, so there is no need to do the complicate conversion like per-group.
jeromeku commented 5 months ago

@HandH1998

Many thanks for the response!

Do you have the script used to test against other methods? Especially interested in reproducing the results against QoQ.

Also can't seem to find the FastINT4toINT8 conversion function when converting from int4 -> int8.

HandH1998 commented 5 months ago

@HandH1998

Many thanks for the response!

Do you have the script used to test against other methods? Especially interested in reproducing the results against QoQ.

Also can't seem to find the FastINT4toINT8 conversion function when converting from int4 -> int8.

You can reproduce the QQQ results following the Readme.md's Usage. As for FastINT4toINT8 conversion, you can refer to our paper Section 3.3.1. Actually, it just performs a left shift by 4 bits to convert int4 to int8 in this line https://github.com/HandH1998/QQQ/blob/49f06e0b47c606ca2c5558ade0805b0609d57a8f/csrc/qqq_gemm.cu#L540.

brisker commented 5 months ago

@HandH1998 Are activations dynamic or static quantization in QQQ?( you only mentioned that it is per-token quantization)

HandH1998 commented 5 months ago

@brisker dynamic quantization

brisker commented 5 months ago

@HandH1998 I noticed you have compared your accuracy with QServe, but QServe is w4a8 with kv4, and your QQQ seems to have fp16 kv-cache, is this comparison fair?

HandH1998 commented 5 months ago

@brisker As QServe doesn't offer a precision of w4a8f16, we directly compare QQQ with QServe using w4a8kv4. On the other hand, QServe employs various techniques to mitigate the impact of kv4. According to their paper, SmoothAttention reduces perplexity by 0.05 without adding system overhead. Progressive group quantization further improves perplexity by an additional 0.02, with only a negligible increase in dequantization overhead. Lastly, activation-aware channel reordering enhances perplexity by 0.03. As illustrated in the following figure, the ablation study shows kv4 only increases perplexity by 0.04 compared to kv8 with these techniques. As we know, kv8 can deliver performance almost identical to fp16 kv cache, so the impact of kv4 is negligible. image

brisker commented 5 months ago

@HandH1998 The speedup of QQQ w4a8g128 compared to marlin w4a16g128 seem to be very limited, I think this may be due to the fp16 kvcache of QQQ. Any plan to try QQQ w4a8g128-kv8?

HandH1998 commented 5 months ago

@HandH1998 The speedup of QQQ w4a8g128 compared to marlin w4a16g128 seem to be very limited, I think this may be due to the fp16 kvcache of QQQ. Any plan to try QQQ w4a8g128-kv8?

We think the speedup of QQQ w4a8g128 is limited to the high dtype conversion overhead between FP16 and INT8 as shown in the following picture. QQQ only focuses on the weight quantization, and we don't plan to develop a w4a8g128-kv8. Actually, it can increase the computing throughput of large batch size to replace kvfp16 with kv8, but is not effective for small batch size. If you want to try QQQ with low-bit kv cache, we recommend our vllm PR which provides fp8 kv cache. image

AniZpZ commented 5 months ago

@HandH1998 The speedup of QQQ w4a8g128 compared to marlin w4a16g128 seem to be very limited, I think this may be due to the fp16 kvcache of QQQ. Any plan to try QQQ w4a8g128-kv8?

Thank you for your advice! Currently, prefill speed is more essential for most inference cases, while KV cache quantization lifts decode speed. KV8 has now been well solved, and you are welcome to combine QQQ with KV cache quantization methods!

brisker commented 5 months ago

@AniZpZ @HandH1998 In the figure of your paper, there is w8a8 inference speed. Is this w8a8 inference speed tested on vllm? Which version of vllm? Besides, why is w8a8 even slower than fp16 in your figure? image

HandH1998 commented 5 months ago

@brisker We developed a new version based on this PR to support dynamic activation per-token quantization. We think the online activation quantization will introduce additional overhead, resulting in slower inference speed compared to FP16 at smaller batch sizes. However, as the batch size increases, the scenario becomes compute-bound, and w8a8 is likely to outperform other quantization methods.

brisker commented 5 months ago

@HandH1998 and the fp16 speed in the figure is the vllm-fp16 speed(already armed with paged attention or other accelerating methods), not huggingface-pytorch inference speed, right?

HandH1998 commented 5 months ago

@HandH1998 and the fp16 speed in the figure is the vllm-fp16 speed(already armed with paged attention or other accelerating methods), not huggingface-pytorch inference speed, right?

Yes.

brisker commented 5 months ago

@HandH1998 @AniZpZ

  1. In the PR you mentioned, how to save the corresponding w4a8-format-model to test w4a8-gemm? Is it identical to gptq-marlin w4 storage format?
  2. I use the default codes and configs in this repo, except comment these two lines (https://github.com/HandH1998/QQQ/blob/main/examples/quant_model.py#L70 and https://github.com/HandH1998/QQQ/blob/main/examples/quant_model.py#L61, otherwise NaN loss), and quant Llama2-7B and get the quantized models. And I use something like this to evaluate w4a8 and fp16 inference speed:
    kwargs = {"torch_dtype": torch.float16, "device_map": "auto", "attn_implementation": "eager"}
    fp16_model = AutoModelForCausalLM.from_pretrained(
            args.model_path, trust_remote_code=True, **kwargs
        ) 
    time1 = time.time()
    output_ids = model.generate(**inputs, max_new_tokens=args.max_new_tokens)  # model can be fp16 or w4a8 quantized model generated by QQQ
    time2 = time.time()
    print(f"decoding time: {time2-time1}")

But the w4a8 inference time is nearly double of that of fp16. Is there any bug in this repo? (w4a8 quantize Nan loss is also weird)

fp16 decoding time: 3.2025535106658936
w4a8 decoding time: 5.649582147598267
HandH1998 commented 5 months ago

@brisker Response to your questions: 1.We use examples/quant_model.py to export the model in the w4a8 format. The corresponding code for this format can be found in QQQ/gptq/qlinear/qlinear_marlin.py. Please note that this format is not identical to the gptq-marlin format.

  1. For the NaN issue, you can try modifying the calibrate_path in https://github.com/HandH1998/QQQ/blob/main/quant_config/llama/w4a8.yaml to your pile dataset directory. The evaluation script you used is similar to our examples/test_model.py, which only employs w4a8 GEMM without any other optimizations like kernel fusion. Actually, you should use our vLLM PR to achieve the speedup you wanted. Our repository primarily focuses on exporting quantized models and evaluating the accuracy, rather than directly speeding up inference.
brisker commented 5 months ago

@HandH1998

  1. already use my own pile_dataset directory.
  2. Even without kernel fusion or paged attention,etc, why is w4a8 gemm slower than fp16?
HandH1998 commented 5 months ago

@brisker

  1. May you provide a detailed log for the issue?
  2. Actually, w4a8 GEMM is always faster than fp16 in our evaluation. We employ online activation quantization, but achieve it with simple torch operation in our repo https://github.com/HandH1998/QQQ/blob/49f06e0b47c606ca2c5558ade0805b0609d57a8f/QQQ/gptq/qlinear/qlinear_marlin.py#L245-L249. When the batch size is small, this will significantly slow down the inference.
brisker commented 5 months ago
  1. Here is the log:

(QQQ) root@train-nndf-vllm-2-0:/data1/QQQ-main# python examples/quant_model.py --model_path /dataset/LM-public/LLM/Llama-2-7b --tokenizer_path /dataset/LM-public/LLM/Llama-2-7b --batch_size 8 --dtype float16 --quant_config quant_config/llama/w4a8.yaml --save_path ./debug /usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set max_memory in to a higher value to use more memory (at your own risk). Loading checkpoint shards: 100%|█████████████████████████████████████████████████████| 2/2 [00:19<00:00, 9.54s/it] the quantization config is {'a_qconfig': {'quantizer': 'TokenFixedFakeQuantize', 'observer': 'MinMaxObserver', 'bit': 8, 'symmetric': True, 'ch_axis': 0, 'disable_down_proj': False}, 'w_qconfig': {'quantizer': 'FixedQuantize', 'observer': 'MinMaxObserver', 'bit': 4, 'symmetric': True, 'ch_axis': 0}, 'calibrate': 128, 'calibrate_path': '/share/LLM/data/pile/val.jsonl.zst', 'is_remove_padding': True, 'gptq': {'dataset': 'wikitext2', 'sym': True, 'groupsize': -1, 'mse': False, 'act_order': True, 'percdamp': 0.01, 'nsamples': 128, 'wbits': 4, 'static_groups': True}, 'max_length': 2048, 'migrate': False} begin building calibration data! Saving the dataset (1/1 shards): 100%|█████████████████████████████████| 128/128 [00:00<00:00, 14583.73 examples/s] prepare fp input and output begin smooth! Enable observer and Enable quantize for fake_quant Calibrate the original min range is -4.671875, the original max range is 4.97265625 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -4.67, 4.97 the weight range is -0.82, 0.72 /data1/QQQ-main/QQQ/smooth/quantization/observer.py:147: UserWarning: _aminmax is deprecated as of PyTorch 1.11 and will be removed in a future release. Use aminmax instead. This warning will only appear once per process. (Triggered internally at ../aten/src/ATen/native/TensorCompare.cpp:677.) min_val_cur, max_val_cur = torch._aminmax(y, 1) 0.04 loss at iter 10 0.04 loss at iter 20 0.04 loss at iter 30 0.04 loss at iter 40 0.04 loss at iter 50 0.04 loss at iter 60 0.03 loss at iter 70 0.03 loss at iter 80 0.03 loss at iter 90 0.05 loss at iter 100 the best scale is 6.78, best min range is -0.73, best max range is 0.73 the range of weight becomes -4.65, 4.55 the original min range is -1.2626953125, the original max range is 1.9306640625 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -1.26, 1.93 the weight range is -0.45, 0.42 0.01 loss at iter 10 0.01 loss at iter 20 0.01 loss at iter 30 0.01 loss at iter 40 0.01 loss at iter 50 0.01 loss at iter 60 0.01 loss at iter 70 0.01 loss at iter 80 0.02 loss at iter 90 0.03 loss at iter 100 the best scale is 1.26, best min range is -1.26, best max range is 1.53 the range of weight becomes -0.57, 0.42 the original min range is -2.8515625, the original max range is 1.7607421875 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -2.85, 1.76 the weight range is -0.89, 0.73 0.00 loss at iter 10 0.01 loss at iter 20 0.01 loss at iter 30 0.01 loss at iter 40 0.01 loss at iter 50 0.01 loss at iter 60 0.01 loss at iter 70 0.01 loss at iter 80 0.02 loss at iter 90 0.07 loss at iter 100 the best scale is 1.05, best min range is -2.71, best max range is 1.76 the range of weight becomes -0.89, 0.73 the original min range is -1.818359375, the original max range is 1.552734375 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -1.82, 1.55 the weight range is -0.51, 0.53 0.02 loss at iter 10 0.02 loss at iter 20 0.02 loss at iter 30 0.02 loss at iter 40 0.02 loss at iter 50 0.02 loss at iter 60 0.02 loss at iter 70 0.03 loss at iter 80 0.05 loss at iter 90 0.16 loss at iter 100 the best scale is 1.45, best min range is -1.25, best max range is 1.25 the range of weight becomes -0.51, 0.53 the original min range is -4.109375, the original max range is 9.1171875 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -4.11, 9.12 the weight range is -0.52, 0.57 0.20 loss at iter 10 0.19 loss at iter 20 0.18 loss at iter 30 0.17 loss at iter 40 0.16 loss at iter 50 0.17 loss at iter 60 0.18 loss at iter 70 0.21 loss at iter 80 0.26 loss at iter 90 0.45 loss at iter 100 the best scale is 2.10, best min range is -4.11, best max range is 4.34 the range of weight becomes -1.09, 1.15 the original min range is -1.306640625, the original max range is 1.1962890625 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -1.31, 1.20 the weight range is -0.49, 0.59 0.03 loss at iter 10 0.03 loss at iter 20 0.03 loss at iter 30 0.03 loss at iter 40 0.03 loss at iter 50 0.04 loss at iter 60 0.05 loss at iter 70 0.06 loss at iter 80 0.10 loss at iter 90 0.17 loss at iter 100 the best scale is 1.17, best min range is -1.11, best max range is 1.11 the range of weight becomes -0.49, 0.59 the original min range is -1.1123046875, the original max range is 2.19140625 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -1.11, 2.19 the weight range is -1.09, 0.98 0.14 loss at iter 10 0.23 loss at iter 20 0.43 loss at iter 30 0.75 loss at iter 40 1.58 loss at iter 50 3.47 loss at iter 60 13.44 loss at iter 70 13.95 loss at iter 80 47.61 loss at iter 90 371.34 loss at iter 100 the best scale is 1.11, best min range is -1.01, best max range is 1.98 the range of weight becomes -1.09, 0.98 the original min range is -109.625, the original max range is 1452.0 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -109.62, 1452.00 the weight range is -1.42, 1.56 19.31 loss at iter 10 19.29 loss at iter 20 19.21 loss at iter 30 19.20 loss at iter 40 19.15 loss at iter 50 19.11 loss at iter 60 18.87 loss at iter 70 18.88 loss at iter 80 18.85 loss at iter 90 18.83 loss at iter 100 18.83 loss at iter 110 18.81 loss at iter 120 18.79 loss at iter 130 18.66 loss at iter 140 18.59 loss at iter 150 18.57 loss at iter 160 18.56 loss at iter 170 18.60 loss at iter 180 18.17 loss at iter 190 18.11 loss at iter 200 18.09 loss at iter 210 18.08 loss at iter 220 18.09 loss at iter 230 18.02 loss at iter 240 18.02 loss at iter 250 18.00 loss at iter 260 17.80 loss at iter 270 17.82 loss at iter 280 17.73 loss at iter 290 17.70 loss at iter 300 17.69 loss at iter 310 17.65 loss at iter 320 17.65 loss at iter 330 17.71 loss at iter 340 17.60 loss at iter 350 17.68 loss at iter 360 17.45 loss at iter 370 17.39 loss at iter 380 17.43 loss at iter 390 18.08 loss at iter 400 17.91 loss at iter 410 17.90 loss at iter 420 17.54 loss at iter 430 17.51 loss at iter 440 17.55 loss at iter 450 17.53 loss at iter 460 17.46 loss at iter 470 17.35 loss at iter 480 17.40 loss at iter 490 17.48 loss at iter 500 17.32 loss at iter 510 17.12 loss at iter 520 17.15 loss at iter 530 17.22 loss at iter 540 17.11 loss at iter 550 17.10 loss at iter 560 16.96 loss at iter 570 16.92 loss at iter 580 16.93 loss at iter 590 16.92 loss at iter 600 16.86 loss at iter 610 16.87 loss at iter 620 16.85 loss at iter 630 16.86 loss at iter 640 16.96 loss at iter 650 16.77 loss at iter 660 16.82 loss at iter 670 16.84 loss at iter 680 16.85 loss at iter 690 16.87 loss at iter 700 16.90 loss at iter 710 16.71 loss at iter 720 16.72 loss at iter 730 16.74 loss at iter 740 16.56 loss at iter 750 16.77 loss at iter 760 16.81 loss at iter 770 16.81 loss at iter 780 16.92 loss at iter 790 17.09 loss at iter 800 17.43 loss at iter 810 17.39 loss at iter 820 17.56 loss at iter 830 17.30 loss at iter 840 17.64 loss at iter 850 17.97 loss at iter 860 18.04 loss at iter 870 18.03 loss at iter 880 18.00 loss at iter 890 17.96 loss at iter 900 18.04 loss at iter 910 18.06 loss at iter 920 18.13 loss at iter 930 18.10 loss at iter 940 18.15 loss at iter 950 17.72 loss at iter 960 17.87 loss at iter 970 17.96 loss at iter 980 17.86 loss at iter 990 17.88 loss at iter 1000 16.05 loss at iter 1010 15.97 loss at iter 1020 15.65 loss at iter 1030 15.52 loss at iter 1040 15.37 loss at iter 1050 15.22 loss at iter 1060 15.10 loss at iter 1070 14.97 loss at iter 1080 15.02 loss at iter 1090 14.89 loss at iter 1100 14.81 loss at iter 1110 14.71 loss at iter 1120 14.64 loss at iter 1130 14.24 loss at iter 1140 14.19 loss at iter 1150 14.12 loss at iter 1160 14.00 loss at iter 1170 14.03 loss at iter 1180 14.00 loss at iter 1190 13.89 loss at iter 1200 14.00 loss at iter 1210 14.04 loss at iter 1220 13.96 loss at iter 1230 14.04 loss at iter 1240 14.01 loss at iter 1250 14.17 loss at iter 1260 14.24 loss at iter 1270 14.43 loss at iter 1280 14.89 loss at iter 1290 14.16 loss at iter 1300 14.34 loss at iter 1310 14.12 loss at iter 1320 13.97 loss at iter 1330 13.83 loss at iter 1340 13.75 loss at iter 1350 13.66 loss at iter 1360 13.94 loss at iter 1370 13.37 loss at iter 1380 12.74 loss at iter 1390 12.77 loss at iter 1400 12.70 loss at iter 1410 12.44 loss at iter 1420 12.31 loss at iter 1430 12.08 loss at iter 1440 12.08 loss at iter 1450 11.65 loss at iter 1460 11.61 loss at iter 1470 11.26 loss at iter 1480 11.13 loss at iter 1490 10.96 loss at iter 1500 10.88 loss at iter 1510 10.40 loss at iter 1520 10.25 loss at iter 1530 10.16 loss at iter 1540 10.09 loss at iter 1550 9.97 loss at iter 1560 9.94 loss at iter 1570 9.77 loss at iter 1580 9.69 loss at iter 1590 9.75 loss at iter 1600 9.81 loss at iter 1610 9.84 loss at iter 1620 9.82 loss at iter 1630 9.89 loss at iter 1640 9.71 loss at iter 1650 9.67 loss at iter 1660 9.88 loss at iter 1670 10.07 loss at iter 1680 10.40 loss at iter 1690 10.23 loss at iter 1700 10.82 loss at iter 1710 11.30 loss at iter 1720 11.37 loss at iter 1730 11.62 loss at iter 1740 12.11 loss at iter 1750 11.31 loss at iter 1760 11.65 loss at iter 1770 11.50 loss at iter 1780 11.46 loss at iter 1790 11.12 loss at iter 1800 10.95 loss at iter 1810 10.45 loss at iter 1820 10.76 loss at iter 1830 10.38 loss at iter 1840 10.07 loss at iter 1850 9.66 loss at iter 1860 9.43 loss at iter 1870 9.29 loss at iter 1880 8.91 loss at iter 1890 8.83 loss at iter 1900 8.61 loss at iter 1910 8.41 loss at iter 1920 8.28 loss at iter 1930 8.21 loss at iter 1940 8.14 loss at iter 1950 8.02 loss at iter 1960 7.68 loss at iter 1970 7.71 loss at iter 1980 7.70 loss at iter 1990 7.31 loss at iter 2000 7.31 loss at iter 2010 7.35 loss at iter 2020 7.42 loss at iter 2030 7.56 loss at iter 2040 7.61 loss at iter 2050 7.71 loss at iter 2060 7.97 loss at iter 2070 7.99 loss at iter 2080 7.89 loss at iter 2090 7.77 loss at iter 2100 8.28 loss at iter 2110 8.20 loss at iter 2120 9.00 loss at iter 2130 9.12 loss at iter 2140 9.86 loss at iter 2150 9.86 loss at iter 2160 11.05 loss at iter 2170 11.17 loss at iter 2180 11.67 loss at iter 2190 13.12 loss at iter 2200 13.49 loss at iter 2210 11.77 loss at iter 2220 14.86 loss at iter 2230 14.36 loss at iter 2240 14.59 loss at iter 2250 13.53 loss at iter 2260 13.64 loss at iter 2270 11.91 loss at iter 2280 11.82 loss at iter 2290 10.49 loss at iter 2300 10.62 loss at iter 2310 9.08 loss at iter 2320 9.15 loss at iter 2330 7.60 loss at iter 2340 7.64 loss at iter 2350 6.77 loss at iter 2360 6.67 loss at iter 2370 5.96 loss at iter 2380 5.55 loss at iter 2390 5.62 loss at iter 2400 5.53 loss at iter 2410 4.78 loss at iter 2420 4.47 loss at iter 2430 3.99 loss at iter 2440 3.95 loss at iter 2450 3.86 loss at iter 2460 3.57 loss at iter 2470 3.59 loss at iter 2480 3.86 loss at iter 2490 3.87 loss at iter 2500 4.49 loss at iter 2510 4.50 loss at iter 2520 5.85 loss at iter 2530 6.07 loss at iter 2540 8.18 loss at iter 2550 8.93 loss at iter 2560 12.43 loss at iter 2570 13.23 loss at iter 2580 17.98 loss at iter 2590 19.02 loss at iter 2600 26.04 loss at iter 2610 27.92 loss at iter 2620 37.66 loss at iter 2630 41.27 loss at iter 2640 53.93 loss at iter 2650 59.08 loss at iter 2660 77.32 loss at iter 2670 94.08 loss at iter 2680 85.81 loss at iter 2690 85.34 loss at iter 2700 85.82 loss at iter 2710 85.09 loss at iter 2720 85.00 loss at iter 2730 85.21 loss at iter 2740 85.14 loss at iter 2750 84.86 loss at iter 2760 85.27 loss at iter 2770 85.26 loss at iter 2780 84.75 loss at iter 2790 85.16 loss at iter 2800 84.67 loss at iter 2810 84.61 loss at iter 2820 85.41 loss at iter 2830 84.99 loss at iter 2840 84.44 loss at iter 2850 84.72 loss at iter 2860 85.42 loss at iter 2870 85.07 loss at iter 2880 84.55 loss at iter 2890 85.82 loss at iter 2900 the best scale is 6.94, best min range is -109.62, best max range is 209.12 the range of weight becomes -6.40, 10.84 the original min range is -6.1796875, the original max range is 8.6953125 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -6.18, 8.70 the weight range is -0.75, 1.10 0.22 loss at iter 10 0.23 loss at iter 20 0.20 loss at iter 30 0.18 loss at iter 40 0.19 loss at iter 50 0.16 loss at iter 60 0.16 loss at iter 70 0.32 loss at iter 80 1.15 loss at iter 90 1.68 loss at iter 100 the best scale is 2.46, best min range is -3.54, best max range is 3.54 the range of weight becomes -0.75, 1.10 the original min range is -0.80224609375, the original max range is 1.009765625 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -0.80, 1.01 the weight range is -0.61, 0.52 0.04 loss at iter 10 0.04 loss at iter 20 0.04 loss at iter 30 0.04 loss at iter 40 0.04 loss at iter 50 0.04 loss at iter 60 0.05 loss at iter 70 0.08 loss at iter 80 0.12 loss at iter 90 0.19 loss at iter 100 the best scale is 1.00, best min range is -0.80, best max range is 1.01 the range of weight becomes -0.61, 0.52 the original min range is -4.04296875, the original max range is 5.6875 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -4.04, 5.69 the weight range is -0.53, 0.46 0.05 loss at iter 10 0.05 loss at iter 20 0.05 loss at iter 30 0.05 loss at iter 40 0.05 loss at iter 50 0.05 loss at iter 60 0.05 loss at iter 70 0.06 loss at iter 80 0.10 loss at iter 90 0.27 loss at iter 100 the best scale is 1.21, best min range is -4.04, best max range is 4.68 the range of weight becomes -0.53, 0.46 the original min range is -2.1640625, the original max range is 4.15625 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -2.16, 4.16 the weight range is -0.67, 0.90 0.09 loss at iter 10 0.09 loss at iter 20 0.09 loss at iter 30 0.09 loss at iter 40 0.09 loss at iter 50 0.09 loss at iter 60 0.09 loss at iter 70 0.10 loss at iter 80 0.20 loss at iter 90 1.11 loss at iter 100 the best scale is 1.16, best min range is -2.16, best max range is 3.59 the range of weight becomes -0.67, 1.04 the original min range is -9.0234375, the original max range is 9.4296875 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -9.02, 9.43 the weight range is -0.69, 0.78 1.29 loss at iter 10 1.18 loss at iter 20 0.97 loss at iter 30 0.84 loss at iter 40 0.83 loss at iter 50 0.85 loss at iter 60 1.02 loss at iter 70 1.87 loss at iter 80 3.75 loss at iter 90 3.77 loss at iter 100 the best scale is 1.55, best min range is -6.07, best max range is 6.07 the range of weight becomes -0.69, 0.78 the original min range is -1.43359375, the original max range is 1.94140625 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -1.43, 1.94 the weight range is -0.54, 0.47 0.07 loss at iter 10 0.07 loss at iter 20 0.07 loss at iter 30 0.07 loss at iter 40 0.07 loss at iter 50 0.07 loss at iter 60 0.08 loss at iter 70 0.13 loss at iter 80 0.25 loss at iter 90 0.37 loss at iter 100 the best scale is 1.59, best min range is -1.22, best max range is 1.22 the range of weight becomes -0.54, 0.47 the original min range is -3.515625, the original max range is 2.857421875 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -3.52, 2.86 the weight range is -0.45, 0.58 0.12 loss at iter 10 0.12 loss at iter 20 0.12 loss at iter 30 0.12 loss at iter 40 0.12 loss at iter 50 0.12 loss at iter 60 0.14 loss at iter 70 0.20 loss at iter 80 0.45 loss at iter 90 0.46 loss at iter 100 the best scale is 1.12, best min range is -3.14, best max range is 2.86 the range of weight becomes -0.45, 0.58 the original min range is -3.2734375, the original max range is 4.48828125 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -3.27, 4.49 the weight range is -0.67, 0.68 0.17 loss at iter 10 0.17 loss at iter 20 0.17 loss at iter 30 0.17 loss at iter 40 0.17 loss at iter 50 0.17 loss at iter 60 0.18 loss at iter 70 0.21 loss at iter 80 0.49 loss at iter 90 2.17 loss at iter 100 the best scale is 1.23, best min range is -3.27, best max range is 3.65 the range of weight becomes -0.67, 0.68 the original min range is -9.84375, the original max range is 10.0078125 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -9.84, 10.01 the weight range is -0.68, 0.64 0.66 loss at iter 10 0.71 loss at iter 20 0.73 loss at iter 30 0.78 loss at iter 40 0.75 loss at iter 50 0.78 loss at iter 60 1.20 loss at iter 70 2.03 loss at iter 80 2.48 loss at iter 90 2.52 loss at iter 100 the best scale is 1.13, best min range is -8.82, best max range is 8.82 the range of weight becomes -0.68, 0.64 the original min range is -1.4287109375, the original max range is 1.7275390625 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -1.43, 1.73 the weight range is -0.62, 0.53 0.09 loss at iter 10 0.09 loss at iter 20 0.09 loss at iter 30 0.09 loss at iter 40 0.09 loss at iter 50 0.10 loss at iter 60 0.13 loss at iter 70 0.21 loss at iter 80 0.35 loss at iter 90 0.50 loss at iter 100 the best scale is 1.45, best min range is -1.19, best max range is 1.19 the range of weight becomes -0.62, 0.53 the original min range is -5.93359375, the original max range is 7.40234375 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -5.93, 7.40 the weight range is -0.52, 0.45 0.24 loss at iter 10 0.24 loss at iter 20 0.24 loss at iter 30 0.24 loss at iter 40 0.24 loss at iter 50 0.24 loss at iter 60 0.25 loss at iter 70 0.30 loss at iter 80 0.68 loss at iter 90 1.43 loss at iter 100 the best scale is 1.44, best min range is -5.14, best max range is 5.14 the range of weight becomes -0.52, 0.45 the original min range is -6.1328125, the original max range is 9.65625 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -6.13, 9.66 the weight range is -0.98, 0.81 0.38 loss at iter 10 0.38 loss at iter 20 0.38 loss at iter 30 0.38 loss at iter 40 0.39 loss at iter 50 0.39 loss at iter 60 0.42 loss at iter 70 0.52 loss at iter 80 1.07 loss at iter 90 5.76 loss at iter 100 the best scale is 1.33, best min range is -6.13, best max range is 7.27 the range of weight becomes -0.98, 0.81 the original min range is -10.0703125, the original max range is 9.28125 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -10.07, 9.28 the weight range is -0.59, 0.64 0.95 loss at iter 10 0.94 loss at iter 20 0.82 loss at iter 30 0.88 loss at iter 40 0.84 loss at iter 50 1.03 loss at iter 60 1.75 loss at iter 70 5.10 loss at iter 80 9.43 loss at iter 90 9.15 loss at iter 100 the best scale is 1.36, best min range is -7.38, best max range is 7.38 the range of weight becomes -0.59, 0.64 the original min range is -1.796875, the original max range is 1.6865234375 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -1.80, 1.69 the weight range is -0.53, 0.45 0.21 loss at iter 10 0.21 loss at iter 20 0.20 loss at iter 30 0.21 loss at iter 40 0.21 loss at iter 50 0.23 loss at iter 60 0.28 loss at iter 70 0.44 loss at iter 80 0.67 loss at iter 90 0.83 loss at iter 100 the best scale is 1.41, best min range is -1.27, best max range is 1.27 the range of weight becomes -0.53, 0.45 the original min range is -5.375, the original max range is 3.220703125 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -5.38, 3.22 the weight range is -0.49, 0.42 0.34 loss at iter 10 0.34 loss at iter 20 0.34 loss at iter 30 0.34 loss at iter 40 0.35 loss at iter 50 0.36 loss at iter 60 0.41 loss at iter 70 0.60 loss at iter 80 1.50 loss at iter 90 1.73 loss at iter 100 the best scale is 1.00, best min range is -5.38, best max range is 3.22 the range of weight becomes -0.49, 0.42 the original min range is -5.71484375, the original max range is 3.98046875 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -5.71, 3.98 the weight range is -0.73, 0.56 0.45 loss at iter 10 0.46 loss at iter 20 0.46 loss at iter 30 0.45 loss at iter 40 0.46 loss at iter 50 0.46 loss at iter 60 0.47 loss at iter 70 0.66 loss at iter 80 1.74 loss at iter 90 6.30 loss at iter 100 the best scale is 1.04, best min range is -5.49, best max range is 3.98 the range of weight becomes -0.73, 0.56 the original min range is -10.90625, the original max range is 11.6640625 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -10.91, 11.66 the weight range is -0.58, 0.61 2.16 loss at iter 10 2.02 loss at iter 20 1.97 loss at iter 30 2.06 loss at iter 40 2.07 loss at iter 50 2.27 loss at iter 60 4.16 loss at iter 70 11.41 loss at iter 80 20.70 loss at iter 90 19.58 loss at iter 100 the best scale is 1.42, best min range is -8.20, best max range is 8.20 the range of weight becomes -0.58, 0.61 the original min range is -1.9697265625, the original max range is 2.201171875 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -1.97, 2.20 the weight range is -0.61, 0.54 0.29 loss at iter 10 0.29 loss at iter 20 0.29 loss at iter 30 0.29 loss at iter 40 0.30 loss at iter 50 0.35 loss at iter 60 0.49 loss at iter 70 0.83 loss at iter 80 1.27 loss at iter 90 1.53 loss at iter 100 the best scale is 1.36, best min range is -1.61, best max range is 1.61 the range of weight becomes -0.61, 0.54 the original min range is -5.7421875, the original max range is 3.126953125 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -5.74, 3.13 the weight range is -0.37, 0.57 0.54 loss at iter 10 0.55 loss at iter 20 0.56 loss at iter 30 0.57 loss at iter 40 0.60 loss at iter 50 0.65 loss at iter 60 0.76 loss at iter 70 1.10 loss at iter 80 2.54 loss at iter 90 2.78 loss at iter 100 the best scale is 1.00, best min range is -5.74, best max range is 3.13 the range of weight becomes -0.37, 0.57 the original min range is -4.27734375, the original max range is 9.6640625 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -4.28, 9.66 the weight range is -0.92, 0.73 0.77 loss at iter 10 0.77 loss at iter 20 0.77 loss at iter 30 0.77 loss at iter 40 0.77 loss at iter 50 0.77 loss at iter 60 0.76 loss at iter 70 0.81 loss at iter 80 1.84 loss at iter 90 11.07 loss at iter 100 the best scale is 2.59, best min range is -3.73, best max range is 3.73 the range of weight becomes -1.62, 0.73 the original min range is -12.5703125, the original max range is 12.5390625 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -12.57, 12.54 the weight range is -0.56, 0.76 2.81 loss at iter 10 2.81 loss at iter 20 2.81 loss at iter 30 2.93 loss at iter 40 3.35 loss at iter 50 3.73 loss at iter 60 6.53 loss at iter 70 17.48 loss at iter 80 26.53 loss at iter 90 24.18 loss at iter 100 the best scale is 1.28, best min range is -9.84, best max range is 9.83 the range of weight becomes -0.56, 0.76 the original min range is -2.30859375, the original max range is 2.181640625 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -2.31, 2.18 the weight range is -0.56, 0.41 0.35 loss at iter 10 0.35 loss at iter 20 0.35 loss at iter 30 0.35 loss at iter 40 0.36 loss at iter 50 0.39 loss at iter 60 0.53 loss at iter 70 0.89 loss at iter 80 1.31 loss at iter 90 1.39 loss at iter 100 the best scale is 1.27, best min range is -1.82, best max range is 1.82 the range of weight becomes -0.56, 0.41 the original min range is -6.68359375, the original max range is 4.1953125 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -6.68, 4.20 the weight range is -0.46, 0.43 0.75 loss at iter 10 0.75 loss at iter 20 0.76 loss at iter 30 0.78 loss at iter 40 0.80 loss at iter 50 0.86 loss at iter 60 1.00 loss at iter 70 1.47 loss at iter 80 3.50 loss at iter 90 4.15 loss at iter 100 the best scale is 1.00, best min range is -6.68, best max range is 4.20 the range of weight becomes -0.46, 0.43 the original min range is -9.2578125, the original max range is 3.947265625 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -9.26, 3.95 the weight range is -0.95, 0.71 0.92 loss at iter 10 0.92 loss at iter 20 0.92 loss at iter 30 0.92 loss at iter 40 0.92 loss at iter 50 0.92 loss at iter 60 0.92 loss at iter 70 1.01 loss at iter 80 2.38 loss at iter 90 11.73 loss at iter 100 the best scale is 3.36, best min range is -2.76, best max range is 2.76 the range of weight becomes -0.95, 2.38 the original min range is -19.21875, the original max range is 14.71875 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -19.22, 14.72 the weight range is -0.56, 0.47 2.47 loss at iter 10 2.53 loss at iter 20 2.58 loss at iter 30 2.20 loss at iter 40 2.32 loss at iter 50 2.65 loss at iter 60 3.17 loss at iter 70 6.92 loss at iter 80 21.16 loss at iter 90 20.45 loss at iter 100 the best scale is 1.78, best min range is -10.81, best max range is 10.80 the range of weight becomes -0.56, 0.47 the original min range is -1.6796875, the original max range is 2.33984375 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -1.68, 2.34 the weight range is -0.61, 0.75 0.46 loss at iter 10 0.45 loss at iter 20 0.46 loss at iter 30 0.46 loss at iter 40 0.47 loss at iter 50 0.53 loss at iter 60 0.71 loss at iter 70 1.15 loss at iter 80 1.63 loss at iter 90 1.75 loss at iter 100 the best scale is 1.19, best min range is -1.68, best max range is 1.96 the range of weight becomes -0.61, 0.75 the original min range is -8.796875, the original max range is 4.89453125 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -8.80, 4.89 the weight range is -0.44, 0.65 0.87 loss at iter 10 0.87 loss at iter 20 0.87 loss at iter 30 0.88 loss at iter 40 0.90 loss at iter 50 0.92 loss at iter 60 1.01 loss at iter 70 1.36 loss at iter 80 3.20 loss at iter 90 5.43 loss at iter 100 the best scale is 1.00, best min range is -8.80, best max range is 4.89 the range of weight becomes -0.44, 0.65 the original min range is -5.6171875, the original max range is 16.0625 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -5.62, 16.06 the weight range is -0.80, 0.76 1.05 loss at iter 10 1.04 loss at iter 20 1.04 loss at iter 30 1.04 loss at iter 40 1.03 loss at iter 50 1.03 loss at iter 60 1.04 loss at iter 70 1.04 loss at iter 80 1.38 loss at iter 90 13.04 loss at iter 100 the best scale is 1.99, best min range is -5.62, best max range is 8.08 the range of weight becomes -1.36, 0.76 the original min range is -24.40625, the original max range is 13.28125 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -24.41, 13.28 the weight range is -0.78, 0.44 3.13 loss at iter 10 3.19 loss at iter 20 3.28 loss at iter 30 3.43 loss at iter 40 3.49 loss at iter 50 3.68 loss at iter 60 5.07 loss at iter 70 7.32 loss at iter 80 20.72 loss at iter 90 24.92 loss at iter 100 the best scale is 1.00, best min range is -24.41, best max range is 13.28 the range of weight becomes -0.78, 0.44 the original min range is -1.9365234375, the original max range is 2.53125 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -1.94, 2.53 the weight range is -0.55, 0.51 0.72 loss at iter 10 0.72 loss at iter 20 0.72 loss at iter 30 0.72 loss at iter 40 0.72 loss at iter 50 0.78 loss at iter 60 1.02 loss at iter 70 1.56 loss at iter 80 2.06 loss at iter 90 2.16 loss at iter 100 the best scale is 1.60, best min range is -1.58, best max range is 1.58 the range of weight becomes -0.55, 0.51 the original min range is -11.296875, the original max range is 4.82421875 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -11.30, 4.82 the weight range is -0.38, 0.61 1.01 loss at iter 10 1.02 loss at iter 20 1.03 loss at iter 30 1.04 loss at iter 40 1.04 loss at iter 50 1.06 loss at iter 60 1.13 loss at iter 70 1.39 loss at iter 80 2.78 loss at iter 90 6.27 loss at iter 100 the best scale is 1.01, best min range is -11.19, best max range is 4.82 the range of weight becomes -0.38, 0.62 the original min range is -9.5, the original max range is 6.79296875 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -9.50, 6.79 the weight range is -0.66, 0.64 1.22 loss at iter 10 1.22 loss at iter 20 1.22 loss at iter 30 1.22 loss at iter 40 1.22 loss at iter 50 1.23 loss at iter 60 1.25 loss at iter 70 1.40 loss at iter 80 3.46 loss at iter 90 14.49 loss at iter 100 the best scale is 1.15, best min range is -8.28, best max range is 6.79 the range of weight becomes -0.66, 0.73 the original min range is -32.71875, the original max range is 14.6328125 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -32.72, 14.63 the weight range is -0.46, 0.46 3.82 loss at iter 10 3.88 loss at iter 20 4.10 loss at iter 30 4.27 loss at iter 40 4.81 loss at iter 50 5.66 loss at iter 60 6.44 loss at iter 70 9.22 loss at iter 80 19.09 loss at iter 90 32.53 loss at iter 100 the best scale is 1.00, best min range is -32.72, best max range is 14.63 the range of weight becomes -0.46, 0.46 the original min range is -2.173828125, the original max range is 2.109375 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -2.17, 2.11 the weight range is -0.41, 0.40 0.95 loss at iter 10 0.95 loss at iter 20 0.95 loss at iter 30 0.96 loss at iter 40 1.00 loss at iter 50 1.14 loss at iter 60 1.50 loss at iter 70 2.06 loss at iter 80 2.38 loss at iter 90 2.41 loss at iter 100 the best scale is 1.21, best min range is -1.80, best max range is 1.80 the range of weight becomes -0.41, 0.40 the original min range is -6.20703125, the original max range is 5.11328125 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -6.21, 5.11 the weight range is -0.53, 0.37 1.10 loss at iter 10 1.10 loss at iter 20 1.11 loss at iter 30 1.13 loss at iter 40 1.16 loss at iter 50 1.29 loss at iter 60 1.63 loss at iter 70 2.67 loss at iter 80 6.38 loss at iter 90 6.59 loss at iter 100 the best scale is 1.00, best min range is -6.21, best max range is 5.11 the range of weight becomes -0.53, 0.37 the original min range is -6.12109375, the original max range is 3.884765625 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -6.12, 3.88 the weight range is -0.75, 1.02 1.53 loss at iter 10 1.53 loss at iter 20 1.53 loss at iter 30 1.53 loss at iter 40 1.54 loss at iter 50 1.63 loss at iter 60 2.03 loss at iter 70 3.97 loss at iter 80 10.29 loss at iter 90 19.83 loss at iter 100 the best scale is 1.38, best min range is -4.43, best max range is 3.88 the range of weight becomes -0.75, 1.02 the original min range is -26.453125, the original max range is 16.953125 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -26.45, 16.95 the weight range is -0.42, 0.61 4.37 loss at iter 10 4.50 loss at iter 20 4.77 loss at iter 30 5.13 loss at iter 40 5.60 loss at iter 50 6.11 loss at iter 60 7.31 loss at iter 70 11.28 loss at iter 80 28.66 loss at iter 90 30.25 loss at iter 100 the best scale is 1.00, best min range is -26.45, best max range is 16.95 the range of weight becomes -0.42, 0.61 the original min range is -2.017578125, the original max range is 2.94921875 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -2.02, 2.95 the weight range is -0.54, 0.58 1.16 loss at iter 10 1.16 loss at iter 20 1.16 loss at iter 30 1.16 loss at iter 40 1.16 loss at iter 50 1.19 loss at iter 60 1.38 loss at iter 70 2.10 loss at iter 80 3.01 loss at iter 90 3.23 loss at iter 100 the best scale is 1.25, best min range is -2.02, best max range is 2.35 the range of weight becomes -0.54, 0.58 the original min range is -6.36328125, the original max range is 5.1484375 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -6.36, 5.15 the weight range is -0.42, 0.42 1.25 loss at iter 10 1.25 loss at iter 20 1.25 loss at iter 30 1.27 loss at iter 40 1.31 loss at iter 50 1.45 loss at iter 60 1.82 loss at iter 70 3.01 loss at iter 80 7.02 loss at iter 90 7.19 loss at iter 100 the best scale is 1.00, best min range is -6.36, best max range is 5.15 the range of weight becomes -0.42, 0.42 the original min range is -5.87109375, the original max range is 4.9765625 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -5.87, 4.98 the weight range is -0.77, 0.64 1.66 loss at iter 10 1.66 loss at iter 20 1.66 loss at iter 30 1.67 loss at iter 40 1.71 loss at iter 50 1.87 loss at iter 60 2.58 loss at iter 70 4.79 loss at iter 80 11.95 loss at iter 90 20.90 loss at iter 100 the best scale is 1.10, best min range is -5.36, best max range is 4.98 the range of weight becomes -0.77, 0.64 the original min range is -27.03125, the original max range is 16.484375 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -27.03, 16.48 the weight range is -0.46, 0.52 4.27 loss at iter 10 4.42 loss at iter 20 4.66 loss at iter 30 5.00 loss at iter 40 5.60 loss at iter 50 6.73 loss at iter 60 8.74 loss at iter 70 13.56 loss at iter 80 36.13 loss at iter 90 39.00 loss at iter 100 the best scale is 1.00, best min range is -27.03, best max range is 16.48 the range of weight becomes -0.46, 0.52 the original min range is -2.162109375, the original max range is 3.068359375 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -2.16, 3.07 the weight range is -0.52, 0.50 1.26 loss at iter 10 1.26 loss at iter 20 1.26 loss at iter 30 1.27 loss at iter 40 1.28 loss at iter 50 1.34 loss at iter 60 1.70 loss at iter 70 2.62 loss at iter 80 3.47 loss at iter 90 3.63 loss at iter 100 the best scale is 1.25, best min range is -2.16, best max range is 2.45 the range of weight becomes -0.52, 0.50 the original min range is -7.73046875, the original max range is 5.015625 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -7.73, 5.02 the weight range is -0.34, 0.36 1.40 loss at iter 10 1.40 loss at iter 20 1.41 loss at iter 30 1.42 loss at iter 40 1.45 loss at iter 50 1.53 loss at iter 60 1.80 loss at iter 70 2.75 loss at iter 80 7.04 loss at iter 90 8.32 loss at iter 100 the best scale is 1.00, best min range is -7.73, best max range is 5.02 the range of weight becomes -0.34, 0.36 the original min range is -6.23828125, the original max range is 6.5625 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -6.24, 6.56 the weight range is -0.59, 0.65 1.98 loss at iter 10 1.98 loss at iter 20 1.98 loss at iter 30 1.99 loss at iter 40 2.06 loss at iter 50 2.40 loss at iter 60 3.54 loss at iter 70 6.75 loss at iter 80 16.81 loss at iter 90 27.26 loss at iter 100 the best scale is 1.36, best min range is -4.82, best max range is 4.82 the range of weight becomes -0.59, 0.65 the original min range is -22.578125, the original max range is 14.5859375 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -22.58, 14.59 the weight range is -0.42, 0.68 4.39 loss at iter 10 4.48 loss at iter 20 4.58 loss at iter 30 4.67 loss at iter 40 4.62 loss at iter 50 5.31 loss at iter 60 6.50 loss at iter 70 12.50 loss at iter 80 37.71 loss at iter 90 40.32 loss at iter 100 the best scale is 1.00, best min range is -22.58, best max range is 14.59 the range of weight becomes -0.42, 0.68 the original min range is -2.564453125, the original max range is 2.240234375 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -2.56, 2.24 the weight range is -0.46, 0.51 1.36 loss at iter 10 1.36 loss at iter 20 1.36 loss at iter 30 1.36 loss at iter 40 1.41 loss at iter 50 1.62 loss at iter 60 2.23 loss at iter 70 3.07 loss at iter 80 3.55 loss at iter 90 3.60 loss at iter 100 the best scale is 1.44, best min range is -1.78, best max range is 1.78 the range of weight becomes -0.46, 0.51 the original min range is -9.375, the original max range is 4.8046875 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -9.38, 4.80 the weight range is -0.43, 0.42 1.82 loss at iter 10 1.81 loss at iter 20 1.85 loss at iter 30 1.88 loss at iter 40 1.99 loss at iter 50 2.15 loss at iter 60 2.35 loss at iter 70 3.16 loss at iter 80 7.54 loss at iter 90 11.00 loss at iter 100 the best scale is 1.00, best min range is -9.38, best max range is 4.80 the range of weight becomes -0.43, 0.42 the original min range is -8.6328125, the original max range is 8.0234375 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -8.63, 8.02 the weight range is -0.61, 1.07 2.79 loss at iter 10 2.79 loss at iter 20 2.79 loss at iter 30 2.79 loss at iter 40 2.87 loss at iter 50 3.24 loss at iter 60 4.67 loss at iter 70 8.71 loss at iter 80 22.14 loss at iter 90 37.65 loss at iter 100 the best scale is 1.48, best min range is -5.82, best max range is 5.82 the range of weight becomes -0.61, 1.07 the original min range is -23.25, the original max range is 15.125 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -23.25, 15.12 the weight range is -0.57, 0.54 4.86 loss at iter 10 5.09 loss at iter 20 5.31 loss at iter 30 5.62 loss at iter 40 6.01 loss at iter 50 6.08 loss at iter 60 7.79 loss at iter 70 14.39 loss at iter 80 38.35 loss at iter 90 43.31 loss at iter 100 the best scale is 1.03, best min range is -22.56, best max range is 15.12 the range of weight becomes -0.57, 0.54 the original min range is -2.455078125, the original max range is 1.9814453125 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -2.46, 1.98 the weight range is -0.60, 0.54 1.55 loss at iter 10 1.55 loss at iter 20 1.55 loss at iter 30 1.54 loss at iter 40 1.57 loss at iter 50 1.76 loss at iter 60 2.38 loss at iter 70 3.01 loss at iter 80 3.15 loss at iter 90 3.15 loss at iter 100 the best scale is 1.65, best min range is -1.49, best max range is 1.49 the range of weight becomes -0.60, 0.54 the original min range is -8.171875, the original max range is 4.671875 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -8.17, 4.67 the weight range is -0.32, 0.48 1.89 loss at iter 10 1.90 loss at iter 20 1.91 loss at iter 30 1.93 loss at iter 40 1.97 loss at iter 50 2.09 loss at iter 60 2.50 loss at iter 70 3.89 loss at iter 80 10.16 loss at iter 90 11.61 loss at iter 100 the best scale is 1.01, best min range is -8.09, best max range is 4.67 the range of weight becomes -0.32, 0.48 the original min range is -9.8203125, the original max range is 11.1796875 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -9.82, 11.18 the weight range is -0.58, 0.66 3.16 loss at iter 10 3.16 loss at iter 20 3.15 loss at iter 30 3.15 loss at iter 40 3.20 loss at iter 50 3.51 loss at iter 60 4.88 loss at iter 70 9.20 loss at iter 80 24.24 loss at iter 90 45.37 loss at iter 100 the best scale is 1.51, best min range is -7.42, best max range is 7.42 the range of weight becomes -0.58, 0.66 the original min range is -21.796875, the original max range is 16.140625 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -21.80, 16.14 the weight range is -0.56, 0.58 6.98 loss at iter 10 7.10 loss at iter 20 7.32 loss at iter 30 7.08 loss at iter 40 7.34 loss at iter 50 7.52 loss at iter 60 10.07 loss at iter 70 18.99 loss at iter 80 48.24 loss at iter 90 49.61 loss at iter 100 the best scale is 1.47, best min range is -14.85, best max range is 14.86 the range of weight becomes -0.56, 0.58 the original min range is -3.126953125, the original max range is 3.376953125 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -3.13, 3.38 the weight range is -0.49, 0.38 1.97 loss at iter 10 1.96 loss at iter 20 1.95 loss at iter 30 1.96 loss at iter 40 2.02 loss at iter 50 2.28 loss at iter 60 3.17 loss at iter 70 4.82 loss at iter 80 5.96 loss at iter 90 6.04 loss at iter 100 the best scale is 1.45, best min range is -2.33, best max range is 2.33 the range of weight becomes -0.49, 0.38 the original min range is -9.3984375, the original max range is 5.07421875 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -9.40, 5.07 the weight range is -0.31, 0.44 2.47 loss at iter 10 2.48 loss at iter 20 2.49 loss at iter 30 2.52 loss at iter 40 2.58 loss at iter 50 2.72 loss at iter 60 3.15 loss at iter 70 4.90 loss at iter 80 12.69 loss at iter 90 16.30 loss at iter 100 the best scale is 1.00, best min range is -9.40, best max range is 5.07 the range of weight becomes -0.31, 0.44 the original min range is -11.640625, the original max range is 8.7890625 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -11.64, 8.79 the weight range is -1.06, 0.63 4.32 loss at iter 10 4.32 loss at iter 20 4.31 loss at iter 30 4.32 loss at iter 40 4.38 loss at iter 50 4.84 loss at iter 60 6.84 loss at iter 70 12.73 loss at iter 80 32.62 loss at iter 90 55.35 loss at iter 100 the best scale is 1.56, best min range is -7.48, best max range is 7.48 the range of weight becomes -1.06, 0.63 the original min range is -20.953125, the original max range is 16.3125 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -20.95, 16.31 the weight range is -1.00, 0.76 7.49 loss at iter 10 7.55 loss at iter 20 7.39 loss at iter 30 7.73 loss at iter 40 8.25 loss at iter 50 8.70 loss at iter 60 11.27 loss at iter 70 26.01 loss at iter 80 59.67 loss at iter 90 63.64 loss at iter 100 the best scale is 1.41, best min range is -14.91, best max range is 14.91 the range of weight becomes -1.00, 0.76 the original min range is -2.671875, the original max range is 3.009765625 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -2.67, 3.01 the weight range is -0.43, 0.39 2.38 loss at iter 10 2.37 loss at iter 20 2.36 loss at iter 30 2.37 loss at iter 40 2.47 loss at iter 50 2.89 loss at iter 60 4.00 loss at iter 70 5.09 loss at iter 80 5.52 loss at iter 90 5.57 loss at iter 100 the best scale is 1.45, best min range is -2.08, best max range is 2.08 the range of weight becomes -0.43, 0.39 the original min range is -9.234375, the original max range is 9.015625 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -9.23, 9.02 the weight range is -0.47, 0.37 3.23 loss at iter 10 3.24 loss at iter 20 3.27 loss at iter 30 3.31 loss at iter 40 3.41 loss at iter 50 3.61 loss at iter 60 4.35 loss at iter 70 6.65 loss at iter 80 17.53 loss at iter 90 20.01 loss at iter 100 the best scale is 1.03, best min range is -8.96, best max range is 8.96 the range of weight becomes -0.47, 0.37 the original min range is -15.53125, the original max range is 17.171875 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -15.53, 17.17 the weight range is -0.77, 0.47 5.87 loss at iter 10 5.88 loss at iter 20 5.92 loss at iter 30 5.98 loss at iter 40 6.13 loss at iter 50 6.59 loss at iter 60 8.55 loss at iter 70 15.44 loss at iter 80 40.77 loss at iter 90 80.62 loss at iter 100 the best scale is 1.03, best min range is -15.53, best max range is 16.66 the range of weight becomes -0.77, 0.47 the original min range is -15.9375, the original max range is 15.6015625 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -15.94, 15.60 the weight range is -0.80, 0.95 7.22 loss at iter 10 7.61 loss at iter 20 7.65 loss at iter 30 7.27 loss at iter 40 8.81 loss at iter 50 11.42 loss at iter 60 27.57 loss at iter 70 73.74 loss at iter 80 98.04 loss at iter 90 93.47 loss at iter 100 the best scale is 1.17, best min range is -13.56, best max range is 13.56 the range of weight becomes -0.80, 0.95 the original min range is -2.77734375, the original max range is 2.994140625 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -2.78, 2.99 the weight range is -0.49, 0.50 1.88 loss at iter 10 1.87 loss at iter 20 1.87 loss at iter 30 1.86 loss at iter 40 1.93 loss at iter 50 2.28 loss at iter 60 3.10 loss at iter 70 3.89 loss at iter 80 4.18 loss at iter 90 4.21 loss at iter 100 the best scale is 1.58, best min range is -1.90, best max range is 1.90 the range of weight becomes -0.49, 0.50 the original min range is -8.6484375, the original max range is 4.97265625 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -8.65, 4.97 the weight range is -0.40, 0.34 3.34 loss at iter 10 3.35 loss at iter 20 3.37 loss at iter 30 3.40 loss at iter 40 3.48 loss at iter 50 3.71 loss at iter 60 4.37 loss at iter 70 6.65 loss at iter 80 16.03 loss at iter 90 16.36 loss at iter 100 the best scale is 1.00, best min range is -8.65, best max range is 4.97 the range of weight becomes -0.40, 0.34 the original min range is -14.765625, the original max range is 16.078125 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -14.77, 16.08 the weight range is -1.15, 0.55 6.12 loss at iter 10 6.12 loss at iter 20 6.12 loss at iter 30 6.12 loss at iter 40 6.17 loss at iter 50 6.67 loss at iter 60 9.09 loss at iter 70 16.98 loss at iter 80 44.55 loss at iter 90 81.86 loss at iter 100 the best scale is 1.42, best min range is -11.28, best max range is 11.28 the range of weight becomes -1.15, 0.55 the original min range is -18.96875, the original max range is 16.875 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -18.97, 16.88 the weight range is -0.77, 0.68 17.57 loss at iter 10 16.84 loss at iter 20 17.56 loss at iter 30 15.99 loss at iter 40 15.99 loss at iter 50 17.78 loss at iter 60 25.08 loss at iter 70 80.55 loss at iter 80 136.00 loss at iter 90 129.57 loss at iter 100 the best scale is 1.84, best min range is -10.29, best max range is 10.29 the range of weight becomes -0.77, 0.68 the original min range is -3.53515625, the original max range is 4.0078125 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -3.54, 4.01 the weight range is -0.56, 0.57 2.40 loss at iter 10 2.39 loss at iter 20 2.39 loss at iter 30 2.38 loss at iter 40 2.42 loss at iter 50 2.75 loss at iter 60 3.88 loss at iter 70 5.49 loss at iter 80 6.34 loss at iter 90 6.40 loss at iter 100 the best scale is 1.64, best min range is -2.45, best max range is 2.45 the range of weight becomes -0.56, 0.57 the original min range is -8.4765625, the original max range is 5.10546875 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -8.48, 5.11 the weight range is -0.36, 0.37 4.52 loss at iter 10 4.53 loss at iter 20 4.55 loss at iter 30 4.60 loss at iter 40 4.74 loss at iter 50 5.17 loss at iter 60 6.25 loss at iter 70 9.95 loss at iter 80 20.72 loss at iter 90 20.77 loss at iter 100 the best scale is 1.00, best min range is -8.48, best max range is 5.11 the range of weight becomes -0.36, 0.37 the original min range is -16.421875, the original max range is 17.453125 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -16.42, 17.45 the weight range is -1.10, 0.66 8.14 loss at iter 10 8.12 loss at iter 20 8.20 loss at iter 30 8.25 loss at iter 40 8.52 loss at iter 50 9.89 loss at iter 60 14.24 loss at iter 70 26.02 loss at iter 80 64.42 loss at iter 90 104.72 loss at iter 100 the best scale is 1.30, best min range is -13.46, best max range is 13.46 the range of weight becomes -1.10, 0.66 the original min range is -15.453125, the original max range is 15.421875 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -15.45, 15.42 the weight range is -0.72, 0.67 11.40 loss at iter 10 11.39 loss at iter 20 11.09 loss at iter 30 10.95 loss at iter 40 11.83 loss at iter 50 15.87 loss at iter 60 38.89 loss at iter 70 97.36 loss at iter 80 115.34 loss at iter 90 113.28 loss at iter 100 the best scale is 1.66, best min range is -9.31, best max range is 9.31 the range of weight becomes -0.72, 0.67 the original min range is -3.615234375, the original max range is 2.96875 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -3.62, 2.97 the weight range is -0.60, 0.68 1.89 loss at iter 10 1.89 loss at iter 20 1.87 loss at iter 30 1.86 loss at iter 40 1.90 loss at iter 50 2.17 loss at iter 60 3.03 loss at iter 70 4.02 loss at iter 80 4.36 loss at iter 90 4.37 loss at iter 100 the best scale is 1.66, best min range is -2.17, best max range is 2.17 the range of weight becomes -0.60, 0.68 the original min range is -8.828125, the original max range is 8.4921875 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -8.83, 8.49 the weight range is -0.57, 0.49 4.85 loss at iter 10 4.86 loss at iter 20 4.90 loss at iter 30 4.99 loss at iter 40 5.26 loss at iter 50 5.78 loss at iter 60 7.04 loss at iter 70 10.88 loss at iter 80 23.37 loss at iter 90 23.44 loss at iter 100 the best scale is 1.12, best min range is -7.87, best max range is 7.87 the range of weight becomes -0.57, 0.49 the original min range is -15.515625, the original max range is 21.625 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -15.52, 21.62 the weight range is -0.97, 0.68 8.36 loss at iter 10 8.36 loss at iter 20 8.38 loss at iter 30 8.45 loss at iter 40 8.59 loss at iter 50 9.28 loss at iter 60 12.02 loss at iter 70 21.68 loss at iter 80 56.77 loss at iter 90 112.42 loss at iter 100 the best scale is 1.12, best min range is -15.52, best max range is 19.25 the range of weight becomes -0.97, 0.68 the original min range is -14.8359375, the original max range is 15.703125 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -14.84, 15.70 the weight range is -1.05, 0.88 11.63 loss at iter 10 11.73 loss at iter 20 10.97 loss at iter 30 11.28 loss at iter 40 12.19 loss at iter 50 14.08 loss at iter 60 39.56 loss at iter 70 101.92 loss at iter 80 129.06 loss at iter 90 127.76 loss at iter 100 the best scale is 1.81, best min range is -8.68, best max range is 8.68 the range of weight becomes -1.05, 0.88 the original min range is -5.03515625, the original max range is 2.736328125 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -5.04, 2.74 the weight range is -0.58, 0.57 2.18 loss at iter 10 2.18 loss at iter 20 2.18 loss at iter 30 2.18 loss at iter 40 2.19 loss at iter 50 2.24 loss at iter 60 2.63 loss at iter 70 3.88 loss at iter 80 5.16 loss at iter 90 5.41 loss at iter 100 the best scale is 1.42, best min range is -3.56, best max range is 2.74 the range of weight becomes -0.58, 0.57 the original min range is -7.9765625, the original max range is 5.1640625 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -7.98, 5.16 the weight range is -0.32, 0.31 5.93 loss at iter 10 5.94 loss at iter 20 5.95 loss at iter 30 6.02 loss at iter 40 6.23 loss at iter 50 6.71 loss at iter 60 7.99 loss at iter 70 13.86 loss at iter 80 25.12 loss at iter 90 25.12 loss at iter 100 the best scale is 1.00, best min range is -7.98, best max range is 5.16 the range of weight becomes -0.32, 0.31 the original min range is -20.828125, the original max range is 16.234375 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -20.83, 16.23 the weight range is -0.55, 1.11 10.42 loss at iter 10 10.41 loss at iter 20 10.30 loss at iter 30 10.30 loss at iter 40 10.51 loss at iter 50 11.53 loss at iter 60 15.94 loss at iter 70 29.19 loss at iter 80 75.50 loss at iter 90 131.69 loss at iter 100 the best scale is 1.47, best min range is -14.20, best max range is 14.20 the range of weight becomes -0.55, 1.11 the original min range is -14.90625, the original max range is 14.78125 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -14.91, 14.78 the weight range is -0.71, 0.94 11.11 loss at iter 10 11.16 loss at iter 20 12.48 loss at iter 30 10.89 loss at iter 40 12.74 loss at iter 50 18.03 loss at iter 60 53.67 loss at iter 70 129.87 loss at iter 80 149.28 loss at iter 90 147.94 loss at iter 100 the best scale is 1.09, best min range is -13.71, best max range is 13.72 the range of weight becomes -0.71, 0.94 the original min range is -4.32421875, the original max range is 3.69921875 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -4.32, 3.70 the weight range is -0.55, 0.54 1.79 loss at iter 10 1.79 loss at iter 20 1.79 loss at iter 30 1.79 loss at iter 40 1.79 loss at iter 50 1.92 loss at iter 60 2.53 loss at iter 70 3.40 loss at iter 80 3.73 loss at iter 90 3.73 loss at iter 100 the best scale is 1.50, best min range is -2.89, best max range is 2.89 the range of weight becomes -0.55, 0.54 the original min range is -8.5078125, the original max range is 5.08203125 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -8.51, 5.08 the weight range is -0.30, 0.29 6.11 loss at iter 10 6.12 loss at iter 20 6.13 loss at iter 30 6.19 loss at iter 40 6.31 loss at iter 50 6.67 loss at iter 60 7.96 loss at iter 70 12.75 loss at iter 80 23.24 loss at iter 90 23.26 loss at iter 100 the best scale is 1.00, best min range is -8.51, best max range is 5.08 the range of weight becomes -0.30, 0.29 the original min range is -20.75, the original max range is 22.578125 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -20.75, 22.58 the weight range is -0.71, 0.73 9.72 loss at iter 10 9.69 loss at iter 20 9.68 loss at iter 30 9.68 loss at iter 40 9.90 loss at iter 50 11.12 loss at iter 60 15.84 loss at iter 70 30.08 loss at iter 80 79.97 loss at iter 90 143.31 loss at iter 100 the best scale is 1.49, best min range is -15.17, best max range is 15.16 the range of weight becomes -0.71, 0.73 the original min range is -14.453125, the original max range is 17.5 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -14.45, 17.50 the weight range is -0.59, 0.88 22.00 loss at iter 10 21.40 loss at iter 20 22.88 loss at iter 30 20.25 loss at iter 40 21.11 loss at iter 50 22.12 loss at iter 60 47.24 loss at iter 70 159.28 loss at iter 80 213.67 loss at iter 90 205.85 loss at iter 100 the best scale is 1.66, best min range is -10.54, best max range is 10.54 the range of weight becomes -0.59, 0.88 the original min range is -5.5, the original max range is 6.7734375 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -5.50, 6.77 the weight range is -0.74, 0.77 2.90 loss at iter 10 2.91 loss at iter 20 2.90 loss at iter 30 2.91 loss at iter 40 2.91 loss at iter 50 2.97 loss at iter 60 3.35 loss at iter 70 5.13 loss at iter 80 7.45 loss at iter 90 7.63 loss at iter 100 the best scale is 1.07, best min range is -5.50, best max range is 6.30 the range of weight becomes -0.74, 0.77 the original min range is -8.1875, the original max range is 5.609375 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -8.19, 5.61 the weight range is -0.34, 0.37 6.53 loss at iter 10 6.54 loss at iter 20 6.56 loss at iter 30 6.61 loss at iter 40 6.74 loss at iter 50 7.20 loss at iter 60 8.74 loss at iter 70 14.69 loss at iter 80 24.33 loss at iter 90 24.36 loss at iter 100 the best scale is 1.00, best min range is -8.19, best max range is 5.61 the range of weight becomes -0.34, 0.37 the original min range is -25.0625, the original max range is 23.484375 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -25.06, 23.48 the weight range is -0.39, 0.33 10.21 loss at iter 10 10.21 loss at iter 20 10.21 loss at iter 30 10.25 loss at iter 40 10.56 loss at iter 50 11.90 loss at iter 60 16.80 loss at iter 70 31.94 loss at iter 80 87.03 loss at iter 90 161.81 loss at iter 100 the best scale is 1.39, best min range is -18.08, best max range is 18.08 the range of weight becomes -0.39, 0.33 the original min range is -14.5, the original max range is 16.03125 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -14.50, 16.03 the weight range is -0.82, 0.69 22.18 loss at iter 10 22.97 loss at iter 20 24.90 loss at iter 30 27.05 loss at iter 40 24.52 loss at iter 50 29.62 loss at iter 60 71.34 loss at iter 70 179.91 loss at iter 80 208.45 loss at iter 90 205.40 loss at iter 100 the best scale is 1.04, best min range is -14.50, best max range is 15.38 the range of weight becomes -0.82, 0.69 the original min range is -6.5, the original max range is 3.759765625 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -6.50, 3.76 the weight range is -0.48, 0.48 2.61 loss at iter 10 2.61 loss at iter 20 2.62 loss at iter 30 2.63 loss at iter 40 2.67 loss at iter 50 2.81 loss at iter 60 3.44 loss at iter 70 5.65 loss at iter 80 7.39 loss at iter 90 7.50 loss at iter 100 the best scale is 1.15, best min range is -5.67, best max range is 3.76 the range of weight becomes -0.48, 0.48 the original min range is -7.75, the original max range is 6.08203125 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -7.75, 6.08 the weight range is -0.36, 0.42 8.01 loss at iter 10 8.01 loss at iter 20 8.01 loss at iter 30 8.07 loss at iter 40 8.27 loss at iter 50 9.27 loss at iter 60 11.89 loss at iter 70 20.00 loss at iter 80 28.12 loss at iter 90 28.08 loss at iter 100 the best scale is 1.33, best min range is -5.84, best max range is 5.84 the range of weight becomes -0.36, 0.42 the original min range is -31.5625, the original max range is 36.0625 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -31.56, 36.06 the weight range is -0.61, 1.31 12.43 loss at iter 10 12.42 loss at iter 20 12.40 loss at iter 30 12.40 loss at iter 40 12.54 loss at iter 50 13.16 loss at iter 60 16.19 loss at iter 70 28.25 loss at iter 80 79.99 loss at iter 90 195.18 loss at iter 100 the best scale is 1.54, best min range is -23.47, best max range is 23.48 the range of weight becomes -0.61, 1.31 the original min range is -14.734375, the original max range is 15.3828125 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -14.73, 15.38 the weight range is -0.60, 1.03 12.14 loss at iter 10 12.49 loss at iter 20 12.91 loss at iter 30 12.45 loss at iter 40 14.85 loss at iter 50 22.68 loss at iter 60 70.38 loss at iter 70 157.74 loss at iter 80 168.66 loss at iter 90 166.85 loss at iter 100 the best scale is 1.16, best min range is -13.25, best max range is 13.24 the range of weight becomes -0.60, 1.03 the original min range is -5.03125, the original max range is 5.70703125 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -5.03, 5.71 the weight range is -0.73, 0.65 2.67 loss at iter 10 2.66 loss at iter 20 2.66 loss at iter 30 2.64 loss at iter 40 2.68 loss at iter 50 2.81 loss at iter 60 3.57 loss at iter 70 5.61 loss at iter 80 7.20 loss at iter 90 7.28 loss at iter 100 the best scale is 1.73, best min range is -3.30, best max range is 3.30 the range of weight becomes -0.73, 0.65 the original min range is -8.5078125, the original max range is 6.39453125 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -8.51, 6.39 the weight range is -0.35, 0.31 8.14 loss at iter 10 8.13 loss at iter 20 8.13 loss at iter 30 8.16 loss at iter 40 8.27 loss at iter 50 8.69 loss at iter 60 10.34 loss at iter 70 17.09 loss at iter 80 28.30 loss at iter 90 28.33 loss at iter 100 the best scale is 1.22, best min range is -7.00, best max range is 6.39 the range of weight becomes -0.35, 0.31 the original min range is -23.1875, the original max range is 26.84375 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -23.19, 26.84 the weight range is -0.66, 0.73 12.33 loss at iter 10 12.25 loss at iter 20 12.18 loss at iter 30 12.21 loss at iter 40 12.67 loss at iter 50 14.73 loss at iter 60 21.76 loss at iter 70 41.25 loss at iter 80 111.07 loss at iter 90 197.38 loss at iter 100 the best scale is 1.49, best min range is -18.02, best max range is 18.02 the range of weight becomes -0.66, 0.73 the original min range is -13.59375, the original max range is 13.953125 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -13.59, 13.95 the weight range is -0.64, 0.58 20.61 loss at iter 10 20.83 loss at iter 20 19.56 loss at iter 30 17.19 loss at iter 40 21.61 loss at iter 50 50.43 loss at iter 60 144.57 loss at iter 70 249.55 loss at iter 80 268.50 loss at iter 90 268.79 loss at iter 100 the best scale is 1.58, best min range is -8.83, best max range is 8.83 the range of weight becomes -0.64, 0.58 the original min range is -5.00390625, the original max range is 4.2109375 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -5.00, 4.21 the weight range is -0.44, 0.65 2.32 loss at iter 10 2.32 loss at iter 20 2.32 loss at iter 30 2.33 loss at iter 40 2.39 loss at iter 50 2.71 loss at iter 60 3.77 loss at iter 70 5.33 loss at iter 80 6.07 loss at iter 90 6.09 loss at iter 100 the best scale is 1.15, best min range is -4.37, best max range is 4.21 the range of weight becomes -0.44, 0.65 the original min range is -8.6328125, the original max range is 7.19140625 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -8.63, 7.19 the weight range is -0.55, 0.46 10.11 loss at iter 10 10.11 loss at iter 20 10.14 loss at iter 30 10.24 loss at iter 40 10.60 loss at iter 50 11.41 loss at iter 60 14.12 loss at iter 70 24.32 loss at iter 80 37.30 loss at iter 90 37.30 loss at iter 100 the best scale is 1.01, best min range is -8.55, best max range is 7.19 the range of weight becomes -0.55, 0.46 the original min range is -28.03125, the original max range is 42.28125 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -28.03, 42.28 the weight range is -1.03, 0.32 15.27 loss at iter 10 15.27 loss at iter 20 15.27 loss at iter 30 15.29 loss at iter 40 15.41 loss at iter 50 15.99 loss at iter 60 19.01 loss at iter 70 32.20 loss at iter 80 89.95 loss at iter 90 239.02 loss at iter 100 the best scale is 1.03, best min range is -28.03, best max range is 41.03 the range of weight becomes -1.03, 0.32 the original min range is -16.328125, the original max range is 14.9765625 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -16.33, 14.98 the weight range is -0.86, 0.69 15.88 loss at iter 10 14.18 loss at iter 20 14.75 loss at iter 30 13.85 loss at iter 40 14.46 loss at iter 50 20.92 loss at iter 60 63.70 loss at iter 70 182.89 loss at iter 80 224.62 loss at iter 90 223.63 loss at iter 100 the best scale is 1.61, best min range is -10.16, best max range is 10.16 the range of weight becomes -0.86, 0.69 the original min range is -5.93359375, the original max range is 6.46875 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -5.93, 6.47 the weight range is -0.69, 0.63 4.97 loss at iter 10 4.94 loss at iter 20 4.93 loss at iter 30 4.92 loss at iter 40 4.94 loss at iter 50 5.15 loss at iter 60 6.13 loss at iter 70 9.20 loss at iter 80 11.92 loss at iter 90 11.99 loss at iter 100 the best scale is 1.62, best min range is -3.99, best max range is 3.99 the range of weight becomes -0.69, 0.63 the original min range is -9.53125, the original max range is 7.609375 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -9.53, 7.61 the weight range is -0.41, 0.30 10.76 loss at iter 10 10.76 loss at iter 20 10.75 loss at iter 30 10.82 loss at iter 40 11.18 loss at iter 50 12.00 loss at iter 60 14.68 loss at iter 70 24.42 loss at iter 80 43.12 loss at iter 90 43.20 loss at iter 100 the best scale is 1.31, best min range is -7.27, best max range is 7.27 the range of weight becomes -0.41, 0.30 the original min range is -44.5625, the original max range is 46.65625 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -44.56, 46.66 the weight range is -0.40, 0.58 16.30 loss at iter 10 16.26 loss at iter 20 16.24 loss at iter 30 16.27 loss at iter 40 16.52 loss at iter 50 17.55 loss at iter 60 21.72 loss at iter 70 38.46 loss at iter 80 109.69 loss at iter 90 283.09 loss at iter 100 the best scale is 1.45, best min range is -32.22, best max range is 32.22 the range of weight becomes -0.40, 0.58 the original min range is -11.921875, the original max range is 13.2890625 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -11.92, 13.29 the weight range is -0.79, 0.71 26.32 loss at iter 10 25.29 loss at iter 20 26.24 loss at iter 30 26.86 loss at iter 40 25.85 loss at iter 50 39.34 loss at iter 60 132.46 loss at iter 70 287.45 loss at iter 80 318.76 loss at iter 90 318.90 loss at iter 100 the best scale is 1.23, best min range is -10.78, best max range is 10.78 the range of weight becomes -0.79, 0.71 the original min range is -6.17578125, the original max range is 6.16796875 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -6.18, 6.17 the weight range is -0.56, 0.63 4.35 loss at iter 10 4.32 loss at iter 20 4.30 loss at iter 30 4.26 loss at iter 40 4.26 loss at iter 50 4.39 loss at iter 60 5.51 loss at iter 70 8.02 loss at iter 80 9.23 loss at iter 90 9.24 loss at iter 100 the best scale is 1.79, best min range is -3.44, best max range is 3.44 the range of weight becomes -0.56, 0.63 the original min range is -10.5078125, the original max range is 8.0390625 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -10.51, 8.04 the weight range is -0.63, 0.30 13.07 loss at iter 10 13.07 loss at iter 20 13.06 loss at iter 30 13.17 loss at iter 40 13.46 loss at iter 50 14.54 loss at iter 60 17.81 loss at iter 70 30.69 loss at iter 80 59.42 loss at iter 90 59.48 loss at iter 100 the best scale is 1.37, best min range is -7.70, best max range is 7.70 the range of weight becomes -0.63, 0.30 the original min range is -40.5, the original max range is 38.59375 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -40.50, 38.59 the weight range is -0.76, 0.50 20.59 loss at iter 10 20.55 loss at iter 20 20.53 loss at iter 30 20.55 loss at iter 40 20.89 loss at iter 50 23.04 loss at iter 60 31.54 loss at iter 70 59.09 loss at iter 80 161.67 loss at iter 90 327.46 loss at iter 100 the best scale is 1.49, best min range is -27.19, best max range is 27.19 the range of weight becomes -0.76, 0.50 the original min range is -12.546875, the original max range is 12.65625 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -12.55, 12.66 the weight range is -0.77, 0.63 23.53 loss at iter 10 23.66 loss at iter 20 24.06 loss at iter 30 24.19 loss at iter 40 23.25 loss at iter 50 52.22 loss at iter 60 164.42 loss at iter 70 322.55 loss at iter 80 355.16 loss at iter 90 354.91 loss at iter 100 the best scale is 1.98, best min range is -6.38, best max range is 6.38 the range of weight becomes -0.77, 0.64 the original min range is -4.50390625, the original max range is 5.98828125 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -4.50, 5.99 the weight range is -0.78, 0.39 4.51 loss at iter 10 4.50 loss at iter 20 4.49 loss at iter 30 4.48 loss at iter 40 4.50 loss at iter 50 4.94 loss at iter 60 6.83 loss at iter 70 9.87 loss at iter 80 11.26 loss at iter 90 11.33 loss at iter 100 the best scale is 1.73, best min range is -3.46, best max range is 3.46 the range of weight becomes -0.78, 0.40 the original min range is -11.421875, the original max range is 8.328125 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -11.42, 8.33 the weight range is -0.40, 0.29 16.65 loss at iter 10 16.65 loss at iter 20 16.68 loss at iter 30 16.91 loss at iter 40 17.63 loss at iter 50 19.43 loss at iter 60 25.54 loss at iter 70 45.86 loss at iter 80 91.45 loss at iter 90 91.57 loss at iter 100 the best scale is 1.23, best min range is -9.27, best max range is 8.33 the range of weight becomes -0.40, 0.29 the original min range is -29.5, the original max range is 36.9375 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -29.50, 36.94 the weight range is -1.00, 0.57 26.45 loss at iter 10 26.46 loss at iter 20 26.45 loss at iter 30 26.48 loss at iter 40 26.81 loss at iter 50 28.54 loss at iter 60 37.01 loss at iter 70 68.02 loss at iter 80 183.29 loss at iter 90 361.07 loss at iter 100 the best scale is 1.14, best min range is -29.50, best max range is 32.53 the range of weight becomes -1.00, 0.57 the original min range is -12.6953125, the original max range is 12.1640625 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -12.70, 12.16 the weight range is -0.74, 0.82 24.41 loss at iter 10 25.13 loss at iter 20 22.57 loss at iter 30 20.66 loss at iter 40 22.86 loss at iter 50 34.05 loss at iter 60 76.66 loss at iter 70 195.94 loss at iter 80 258.97 loss at iter 90 260.01 loss at iter 100 the best scale is 1.71, best min range is -7.41, best max range is 7.41 the range of weight becomes -0.74, 0.82 the original min range is -3.896484375, the original max range is 4.59375 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -3.90, 4.59 the weight range is -0.59, 0.63 4.05 loss at iter 10 4.05 loss at iter 20 4.04 loss at iter 30 4.09 loss at iter 40 4.17 loss at iter 50 4.90 loss at iter 60 6.82 loss at iter 70 8.64 loss at iter 80 9.26 loss at iter 90 9.27 loss at iter 100 the best scale is 1.38, best min range is -3.34, best max range is 3.34 the range of weight becomes -0.59, 0.66 the original min range is -11.8515625, the original max range is 9.1953125 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -11.85, 9.20 the weight range is -0.51, 0.55 25.19 loss at iter 10 25.20 loss at iter 20 24.99 loss at iter 30 25.56 loss at iter 40 27.68 loss at iter 50 31.75 loss at iter 60 42.52 loss at iter 70 71.88 loss at iter 80 139.81 loss at iter 90 140.06 loss at iter 100 the best scale is 1.40, best min range is -8.45, best max range is 8.45 the range of weight becomes -0.51, 0.55 the original min range is -28.953125, the original max range is 200.0 the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -28.95, 200.00 the weight range is -0.98, 0.50 38.25 loss at iter 10 38.38 loss at iter 20 38.50 loss at iter 30 38.60 loss at iter 40 38.87 loss at iter 50 39.03 loss at iter 60 39.21 loss at iter 70 39.34 loss at iter 80 39.58 loss at iter 90 40.03 loss at iter 100 40.36 loss at iter 110 40.86 loss at iter 120 41.52 loss at iter 130 42.07 loss at iter 140 42.76 loss at iter 150 43.75 loss at iter 160 44.79 loss at iter 170 46.23 loss at iter 180 47.71 loss at iter 190 49.49 loss at iter 200 51.72 loss at iter 210 54.38 loss at iter 220 57.63 loss at iter 230 61.69 loss at iter 240 66.53 loss at iter 250 72.56 loss at iter 260 80.37 loss at iter 270 90.41 loss at iter 280 102.01 loss at iter 290 118.01 loss at iter 300 137.61 loss at iter 310 163.77 loss at iter 320 200.48 loss at iter 330 246.62 loss at iter 340 305.16 loss at iter 350 380.65 loss at iter 360 475.29 loss at iter 370 585.69 loss at iter 380 705.55 loss at iter 390 775.11 loss at iter 400 the best scale is 1.00, best min range is -28.95, best max range is 200.00 the range of weight becomes -0.98, 0.50 the original min range is -12.2421875, the original max range is 9.4765625 the module type is qkv the data type is torch.float16, the device is cuda:0 the activation range is -12.24, 9.48 the weight range is -0.75, 0.82 49.53 loss at iter 10 49.31 loss at iter 20 47.80 loss at iter 30 50.71 loss at iter 40 47.56 loss at iter 50 63.99 loss at iter 60 171.92 loss at iter 70 339.59 loss at iter 80 409.09 loss at iter 90 409.08 loss at iter 100 the best scale is 1.95, best min range is -6.30, best max range is 6.29 the range of weight becomes -0.84, 0.97 the original min range is -6.50390625, the original max range is 9.0546875 the module type is o_proj the data type is torch.float16, the device is cuda:0 the activation range is -6.50, 9.05 the weight range is -0.52, 0.62 6.41 loss at iter 10 6.40 loss at iter 20 6.37 loss at iter 30 6.47 loss at iter 40 6.56 loss at iter 50 7.04 loss at iter 60 8.63 loss at iter 70 14.16 loss at iter 80 22.00 loss at iter 90 23.01 loss at iter 100 the best scale is 1.40, best min range is -6.46, best max range is 6.46 the range of weight becomes -0.67, 0.62 the original min range is -12.46875, the original max range is 16.859375 the module type is up_and_gate the data type is torch.float16, the device is cuda:0 the activation range is -12.47, 16.86 the weight range is -0.80, 0.71 1022.35 loss at iter 10 885.84 loss at iter 20 730.96 loss at iter 30 397.03 loss at iter 40 215.40 loss at iter 50 126.06 loss at iter 60 140.95 loss at iter 70 207.01 loss at iter 80 476.61 loss at iter 90 553.08 loss at iter 100 the best scale is 2.68, best min range is -6.30, best max range is 6.30 the range of weight becomes -0.80, 1.00 the original min range is -153.125, the original max range is inf the module type is down_proj the data type is torch.float16, the device is cuda:0 the activation range is -153.12, inf the weight range is -1.30, 1.05 Traceback (most recent call last): File "/data1/QQQ-main/examples/quant_model.py", line 88, in main() File "/usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/data1/QQQ-main/examples/quant_model.py", line 61, in main scale_list = smooth(model, tokenizer, q_config, args) File "/usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/data1/QQQ-main/QQQ/smooth/smooth.py", line 138, in smooth calibrate_batch(model, [fp_input[0]]) File "/usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/data1/QQQ-main/QQQ/smooth/smooth.py", line 94, in calibrate_batch model(batch) File "/usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/data1/QQQ-main/QQQ/smooth/models/quant_llama.py", line 795, in forward outputs = self.model( File "/usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/data1/QQQ-main/QQQ/smooth/models/quant_llama.py", line 651, in forward layer_outputs = decoder_layer( File "/usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/data1/QQQ-main/QQQ/smooth/models/quant_llama.py", line 451, in forward best_scale = migration( File "/data1/QQQ-main/QQQ/smooth/quantization/migration_llama.py", line 28, in migration migrator = search_class(act, weight, a_qconfig, w_qconfig, module_type, extra_dict) File "/data1/QQQ-main/QQQ/smooth/quantization/migration_llama.py", line 246, in init self.num = max(100, int(self.amx / 0.5)) OverflowError: cannot convert float infinity to integer

  1. So in real Vllm PR, the activation quantization is also simple pytorch operation?

image

HandH1998 commented 5 months ago

@brisker

  1. What version of the Transformers library are you using? I ran the same script with Transformers=4.36.2 and everything worked as expected. You can try another Llama model like Llama2-13b or another calibration dataset like wikitext2 if you are using the right Transformers.
  2. The code in your picture is just for unit test. Actually, the activation quantization is using https://github.com/vllm-project/vllm/blob/614aa5120303ab09be78fb1db669da198cc43b02/csrc/quantization/compressed_tensors/int8_quant_kernels.cu#L43-L71 in inference.
brisker commented 5 months ago

@HandH1998

  1. Here is my pip_list, transformers version is identical to yours, any other difference? (Llama2-13b also fails, loss rise steadily) I still believe there may be difference between the codes in this repo and your local codes.

    Package                  Version     Editable project location
    ------------------------ ----------- ----------------------------------
    absl-py                  2.1.0
    accelerate               0.27.2
    aiohttp                  3.9.5
    aiosignal                1.3.1
    async-timeout            4.0.3
    attrs                    23.2.0
    certifi                  2024.6.2
    chardet                  5.2.0
    charset-normalizer       3.3.2
    click                    8.1.7
    colorama                 0.4.6
    contourpy                1.2.1
    cycler                   0.12.1
    DataProperty             1.0.1
    datasets                 2.17.1
    dill                     0.3.8
    easydict                 1.13
    evaluate                 0.4.2       /data1/QQQ-main/evaluate-0.4.2/src
    filelock                 3.15.4
    fonttools                4.53.0
    frozenlist               1.4.1
    fsspec                   2023.10.0
    huggingface-hub          0.20.3
    idna                     3.7
    Jinja2                   3.1.4
    joblib                   1.4.2
    jsonlines                4.0.0
    kiwisolver               1.4.5
    lm_eval                  0.4.2       /data1/QQQ-main/lm_eval
    lxml                     5.2.2
    MarkupSafe               2.1.5
    matplotlib               3.9.0
    mbstrdecoder             1.1.3
    more-itertools           10.3.0
    mpmath                   1.3.0
    multidict                6.0.5
    multiprocess             0.70.16
    networkx                 3.3
    nltk                     3.8.1
    numexpr                  2.10.1
    numpy                    1.26.4
    nvidia-cublas-cu12       12.1.3.1
    nvidia-cuda-cupti-cu12   12.1.105
    nvidia-cuda-nvrtc-cu12   12.1.105
    nvidia-cuda-runtime-cu12 12.1.105
    nvidia-cudnn-cu12        8.9.2.26
    nvidia-cufft-cu12        11.0.2.54
    nvidia-curand-cu12       10.3.2.106
    nvidia-cusolver-cu12     11.4.5.107
    nvidia-cusparse-cu12     12.1.0.106
    nvidia-nccl-cu12         2.19.3
    nvidia-nvjitlink-cu12    12.5.40
    nvidia-nvtx-cu12         12.1.105
    packaging                24.1
    pandas                   2.2.2
    pathvalidate             3.2.0
    peft                     0.11.1
    pillow                   10.3.0
    pip                      24.1.1
    portalocker              2.10.0
    psutil                   6.0.0
    pyarrow                  16.1.0
    pyarrow-hotfix           0.6
    pybind11                 2.13.1
    pyparsing                3.1.2
    pytablewriter            1.2.0
    python-dateutil          2.9.0.post0
    pytz                     2024.1
    PyYAML                   6.0.1
    QQQ                      0.0.0       /data1/QQQ-main
    regex                    2024.5.15
    requests                 2.32.3
    rouge_score              0.1.2
    sacrebleu                2.4.2
    safetensors              0.4.3
    scikit-learn             1.5.0
    scipy                    1.14.0
    sentencepiece            0.2.0
    setuptools               70.0.0
    six                      1.16.0
    sqlitedict               2.1.0
    sympy                    1.12.1
    tabledata                1.3.3
    tabulate                 0.9.0
    tcolorpy                 0.1.6
    threadpoolctl            3.5.0
    tokenizers               0.15.2
    torch                    2.2.1
    tqdm                     4.66.4
    tqdm-multiprocess        0.0.11
    transformers             4.36.2
    triton                   2.2.0
    typepy                   1.3.2
    typing_extensions        4.12.2
    tzdata                   2024.1
    urllib3                  2.2.2
    wheel                    0.43.0
    word2number              1.1
    xxhash                   3.4.1
    yarl                     1.9.4
    zstandard                0.22.0
  2. w4a8 is well supported in version: https://github.com/HandH1998/vllm/tree/w4a8 , which has not be merged, right?

HandH1998 commented 5 months ago

@brisker

  1. The other packages in your pip_list don't matter. I ran the github code and got the same result. I think you will get the same result as we set the random seed in code. The dataset may differ?Could you give me your email? I will send you the calibration dataset we are using. image

  2. Yes. But we modified our code according to vllm team's advice. If you want to reproduce the speedup in our paper, you can try the original vllm w4a8 https://github.com/HandH1998/vllm/tree/w4a8-fusion.

brisker commented 5 months ago

@HandH1998 Thanks for reply! My email is hellowd@sjtu.edu.cn The dataset I am using comes from hansong-mit-han-lab huggingface hub homepage. I think maybe we are using the same one

brisker commented 5 months ago

@HandH1998 using your pile-data, loss still rising and finally the same error as before. I am using a800-gpu, I think this can not cause nan loss..

brisker commented 5 months ago

@HandH1998 I tried the w4a8-fusion branch of vllm, but when I install it, I got error:

-- CUDA target arches: 80-real
[0/8] Performing download step (git clone) for 'cutlass-populate'
正克隆到 'cutlass-src'...
fatal: 无法访问 'https://github.com/nvidia/cutlass.git/':Failed to connect to github.com port 443: 拒绝连接
正克隆到 'cutlass-src'...
fatal: 无法访问 'https://github.com/nvidia/cutlass.git/':Failed to connect to github.com port 443: 拒绝连接
正克隆到 'cutlass-src'...
fatal: 无法访问 'https://github.com/nvidia/cutlass.git/':Failed to connect to github.com port 443: 拒绝连接
-- Had to git clone more than once: 3 times.
CMake Error at cutlass-subbuild/cutlass-populate-prefix/tmp/cutlass-populate-gitclone.cmake:39 (message):
  Failed to clone repository: 'https://github.com/nvidia/cutlass.git'

FAILED: cutlass-populate-prefix/src/cutlass-populate-stamp/cutlass-populate-download /data1/vllm-w4a8-fusion/build/temp.linux-x86_64-cpython-310/_deps/cutlass-subbuild/cutlass-populate-prefix/src/cutlass-populate-stamp/cutlass-populate-download
cd /data1/vllm-w4a8-fusion/build/temp.linux-x86_64-cpython-310/_deps && /opt/python-3.10.12/lib/python3.10/site-packages/cmake/data/bin/cmake -P /data1/vllm-w4a8-fusion/build/temp.linux-x86_64-cpython-310/_deps/cutlass-subbuild/cutlass-populate-prefix/tmp/cutlass-populate-gitclone.cmake && /opt/python-3.10.12/lib/python3.10/site-packages/cmake/data/bin/cmake -E touch /data1/vllm-w4a8-fusion/build/temp.linux-x86_64-cpython-310/_deps/cutlass-subbuild/cutlass-populate-prefix/src/cutlass-populate-stamp/cutlass-populate-download
ninja: build stopped: subcommand failed.

Currently, I can not have access to github.com to download files online. So I tried to build the cutlass myself, and after I build cutlass from source successfully, and run the install of vllm, same error occurs. Any advice on this? Thanks in advance.

HandH1998 commented 5 months ago

@brisker I have never encountered this problem...

brisker commented 5 months ago

@HandH1998 So which version of cutlass are you using in w4a8-fusion branch here ??

HandH1998 commented 5 months ago

@HandH1998 So which version of cutlass are you using in w4a8-fusion branch here ??

The cutlass version can be found at CMakeLists.txt.

brisker commented 5 months ago

@HandH1998 I have sucessfully quant and infer with w4a8 (per-channel w4, with no group)in vllm, using the qqq quantized models, using the demo you provided here

(the following speed results are directly summarized by vllm on the command line)

w4a8  Processed prompts: 100%|█████████████████████████████████████| 4/4 [00:00<00:00, 28.97it/s, Generation Speed: 463.72 toks/s]
fp16   Processed prompts: 100%|█████████████████████████████████████| 4/4 [00:00<00:00, 19.37it/s, Generation Speed: 309.96 toks/s]
  1. Is that speed normal as expected?
  2. Using your pile data still nan loss, any further advice?( I am wondering, is the wrong acc model still the right w4a8 model, which generate the right inference speed?)
HandH1998 commented 5 months ago

@brisker

  1. The speed looks normal.
  2. I have no idea about the nan loss. Maybe you can try to set dtype to torch.float32 when smoothing models, and set dtype to 'half' before gptq. But I am not sure if this is the right way to solve nan issue, and how it will affect acc. The model doesn't affect the inference speed, i.e., the two models will have the same speed.