apple / coremltools

Core ML tools contain supporting tools for Core ML model conversion, editing, and validation.
https://coremltools.readme.io
BSD 3-Clause "New" or "Revised" License
4.33k stars 627 forks source link

Malloc error converting sklearn decision tree #214

Open mci-s opened 6 years ago

mci-s commented 6 years ago

I've trained a model using scikit-learn's DecisionTreeClassifier on a dataset with 1,600,000 rows, 15 features, max_depth=7. When I try to convert using coremltools for a model over about 2 mb, I get the following error:

malloc: *** error for object 0x7fb096a0f738: incorrect checksum for freed object - object was probably modified after being freed. *** set a breakpoint in malloc_error_break to debug

I might be doing this part wrong, but when I try to log and debug, I get:

Segmentation fault: 11

To recreate:

import pandas as pd
import numpy as np
import sklearn
import coremltools
import random
import string
from sklearn.tree import DecisionTreeClassifier 

X = np.random.choice([0, 1], size=(15*1000,), p=[1./3, 2./3])
X = np.split(X,1000)

y = []
for i in range(0,1000):
    x = ''.join(random.choice(string.lowercase) for x in range(5))
    y.append(x)

clf = DecisionTreeClassifier()
clf.fit(X,y)

d = {'arr': X, 'str': y}
df = pd.DataFrame(data=d)

coreml_model = coremltools.converters.sklearn.convert(clf, "arr", 
    "str") ## malloc error
coreml_model.save('mymodel.mlmodel')
Nurka11 commented 5 years ago

@mci-s Hi, Do you have any solution for this situation? I have a same problem and i think it's related with Ram memory. If you have any solution, please share.

mci-s commented 5 years ago

@Nurka11 Yes it was a memory issue. I ultimately went with a solution that took significantly less memory. You can set max_leaf_nodes to limit the size to some degree, but the .mlmodel size will still be very large. I don't believe there is a solution that doesn't include changing your approach.

TobyRoseman commented 3 years ago

This is still an issue with coremltools 4.1, macOS 11.3 Beta and Scikit Learn 0.19.2. If you're not using Python 2.7, you need to change string.lowercase to string.ascii_lowercase.

I'm not sure this is an issue worth looking into. The synthetic data generated by the example has 1,000 different labels. I don't think a decision tree that predicts a 1,000 different classes is a practical case.