apple / coremltools

Core ML tools contain supporting tools for Core ML model conversion, editing, and validation.
https://coremltools.readme.io
BSD 3-Clause "New" or "Revised" License
4.32k stars 626 forks source link

8.0b2 linear_quantize_activations crash #2321

Open FrancisCaig opened 3 weeks ago

FrancisCaig commented 3 weeks ago

🐞Describing the bug

I'm using 8.0b2 to quant my mlpkg, a resnet and transformer model. It crashed after printed 'calibration success'.

Debug: in _model_debugger.py--step function: outputs = model.predict(inputs), outputs lost 5 layers which I could find in preview mlpackge in xcode. 截屏2024-08-23 17 42 49

When _post_training_quatization.py--_adjust_concat_surrounding_activation_stats function went to the line group_rmin_list.append(activation_stats_dict[tensor_name]["rmin"]), it crashed. Cuz the five lost layers were in concat_list_adjusted/concat_op_info_list, but not in activation_stats_dict. 截屏2024-08-23 17 43 46

Currently not sure is it a bug? What can I do for it?

DawerG commented 3 weeks ago

Thanks @FrancisCaig for reporting this.

Can you please share a script that we can use to reproduce the issue on our side to look further into it?

FrancisCaig commented 2 weeks ago

Thanks @FrancisCaig for reporting this.

Can you please share a script that we can use to reproduce the issue on our side to look further into it?

It's hard to give a simple demo for it, cuz I only met it when I did activation quantization for my model. Any idea about why the result of predict lost some layers? BTW, I updated the first screenshot for predict code part.

FrancisCaig commented 2 weeks ago

Sth new here. See the comment part in screenshot. Some intermediate tensors(5 layers I mentioned before) "cannot be appended to outputs since the type is not valid as an output data type", so the newly created model here lost five layers. And the outputs are saved in activation_stats_dict. The trick part is that in _post_training_quatization.py--_adjust_concat_surrounding_activation_stats function, you get the concat_op_info_list, which is from all intermediate tensors. Then code goes on 'activation_stats_dict[tensor_name]', the lost tensor_name in dict activation_stats_dict, it crashed. I tried remove the lost layers but it alse had other problems in transformer compilation. Any further idea? 截屏2024-08-26 12 16 51

junpeiz commented 1 week ago

Hey @FrancisCaig , thank you for reporting this bug.

You are right, some layers will be skipped, which is expected because not all layers will be calibrated. Here to workaround the crash caused by activation_stats_dict[tensor_name], you could do something like

            # Some tensor_name may not have rmin/rmax if the calibration failed before.
            if "rmin" in activation_stats_dict[tensor_name]:
                group_rmin_list.append(activation_stats_dict[tensor_name]["rmin"])
                group_rmax_list.append(activation_stats_dict[tensor_name]["rmax"])
FrancisCaig commented 6 days ago

Hey @FrancisCaig , thank you for reporting this bug.

You are right, some layers will be skipped, which is expected because not all layers will be calibrated. Here to workaround the crash caused by activation_stats_dict[tensor_name], you could do something like

            # Some tensor_name may not have rmin/rmax if the calibration failed before.
            if "rmin" in activation_stats_dict[tensor_name]:
                group_rmin_list.append(activation_stats_dict[tensor_name]["rmin"])
                group_rmax_list.append(activation_stats_dict[tensor_name]["rmax"])

You mean skip the process for some specific tensors? I try it and it looks like everything is smooth. But it throws an error that I can't predict by the converted model. I attach the screenshot. It turns out something is destroyed in model by this way and the model compiling failed. new_revise_0907