apple / coremltools

Core ML tools contain supporting tools for Core ML model conversion, editing, and validation.
https://coremltools.readme.io
BSD 3-Clause "New" or "Revised" License
4.45k stars 645 forks source link

Race condition writing and loading MLModel in Coremltools 8.1 #2404

Open kasper0406 opened 1 day ago

kasper0406 commented 1 day ago

When having a python debugger attached, I have started to see a race condition when loading the converted MLModel. I have not seen this happen without having a debugger attached, and I have not hit this prior to the 8.1 release.

Specifically it happens in the load_spec function in coremltools/models/utils.py (notice that I modified the code to try again in a loop):

Screenshot 2024-11-20 at 21 58 27

Looking in my file system, I indeed see that the file isn't there:

Screenshot 2024-11-20 at 21 59 26

However, if I run one iteration of the loop (i.e. execute the loading code, and free the Python GIL from the thread), the Manifest.json files gets created:

Screenshot 2024-11-20 at 22 01 01

Notice the timestamp for Manifest.json is 5 minutes later. The next time the loop executes, the model is usually loaded.

I did see some issues that If I step through the loop I added using a debugger, I get the following error:

an integer is required
  File "/Users/knielsen/Library/Application Support/hatch/env/virtual/stablehlo-coreml-experimental/Ux-esJH6/test.py3.12/lib/python3.12/site-packages/_pydevd_sys_monitoring\\_pydevd_sys_monitoring_cython.pyx", line 1367, in _pydevd_sys_monitoring_cython._jump_event
  File "<stringsource>", line 69, in cfunc.to_py.__Pyx_CFunc_7f6725__29_pydevd_sys_monitoring_cython_object__lParen__etc_to_py_4code_11from_offset_9to_offset.wrap
  File "/Users/knielsen/Library/Application Support/hatch/env/virtual/stablehlo-coreml-experimental/Ux-esJH6/test.py3.12/lib/python3.12/site-packages/coremltools/models/utils.py", line 256, in load_spec
    try:
  File "/Users/knielsen/Library/Application Support/hatch/env/virtual/stablehlo-coreml-experimental/Ux-esJH6/test.py3.12/lib/python3.12/site-packages/coremltools/models/model.py", line 531, in _get_proxy_and_spec
    specification = _load_spec(filename)
                    ^^^^^^^^^^^^^^^^^^^^
  File "/Users/knielsen/Library/Application Support/hatch/env/virtual/stablehlo-coreml-experimental/Ux-esJH6/test.py3.12/lib/python3.12/site-packages/coremltools/models/model.py", line 469, in __init__
    self.__proxy__, self._spec, self._framework_error = self._get_proxy_and_spec(

Program to reproduce:

Run the following code with a Python debugger attached using coremltools 8.1:

import coremltools as ct
import numpy as np
from coremltools.converters.mil import Builder as mb

@mb.program(input_specs=[
    mb.TensorSpec(shape=(2, 3, 4, 5)),
    mb.TensorSpec(shape=(2, 4, 3, 5)),
])
def mil_program(arg0, arg1):
    arg0_reshaped = mb.reshape(x=arg0, shape=(1, 120))
    arg1_reshaped = mb.reshape(x=arg1, shape=(1, 120))
    result = mb.matmul(x=arg0_reshaped, y=arg1_reshaped, transpose_x=False, transpose_y=True)
    result = mb.reshape(x=result, shape=(1,))
    return result

cml_model = ct.convert(
    mil_program,
    source="milinternal",
    minimum_deployment_target=ct.target.iOS18,
)

inputs = {
    "arg0": np.random.normal(0.0, 1.0, (2, 3, 4, 5)),
    "arg1": np.random.normal(0.0, 1.0, (2, 4, 3, 5)),
}
predictions = cml_model.predict(inputs)
print(predictions)

I briefly attempted to find the bug myself, but the diff of the 8.1 release (https://github.com/apple/coremltools/pull/2394) is humongous, making it require more effort than I want to spend.

jakesabathia2 commented 1 day ago

@cymbalrush you might will have more context on this issue ^

cymbalrush commented 1 day ago

Investigating