dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.92k stars 1.86k forks source link

LightGBM.Fit() exits with 139 on osx-arm64 #6966

Closed gijs-koehorst closed 5 months ago

gijs-koehorst commented 5 months ago

System Information:

Describe the bug ML.LightGBM by default does not include a binary for osx-arm64. To circumvent this we manually built a lib_lightgbm.dylib binary for arm and copied it to /bin/Debug/net7.0/runtimes/osx-arm64/native. This worked for ML.NET 2.0.0 and ML.LightGBM 2.0.0. However, after upgrading to the above versions and manually adding the lib_lightgbm.dylib of LightGBM 4.2.0, calling BinaryClassificationModel.Trainers.LightGbm.Fit() exits with exit code 139 without any error message.

To Reproduce Steps to reproduce the behavior:

  1. Add lib_lightgbm.dylib of LightGBM 4.2.0 to /bin/Debug/net7.0/runtimes/osx-arm64/native
  2. Create a BinaryClassificationModel.Trainers.LightGbm
  3. Call Fit() with some data

Expected behavior I would expect the LightGBM model to simply train on the data.

ericstj commented 5 months ago

Perhaps this has to do with the version of LightGBM codebase you are building? @michaelgsharp updated our version and had to react to a breaking change in https://github.com/dotnet/machinelearning/pull/6880.

If you were using an earlier version (that worked with 2.0.0 as you suggest) then that would be broken with 3.0.0. Could you update your codebase and see if that fixes things?

ghost commented 5 months ago

This issue has been marked needs-author-action and may be missing some important information.

gijs-koehorst commented 5 months ago

@ericstj as mentioned in my initial comment, it breaks when using the latest versions:

So I am not sure what more I can update in my codebase?

ericstj commented 5 months ago

It's possible the latest version or LightGBM codebase will not work - it definitely wouldn't have worked with 2.0.0 due to the breaking changes we already had to adjust to - we haven't yet tested out the latest version of the LightGBM codebase and it may have more breaking changes. We're currently on 3.3.5 https://github.com/dotnet/machinelearning/blob/54fa44fbf803037a2c1f678052f71491b01c0761/eng/Versions.props#L38. If this method worked for you before, you might try building from their 3.3.5 tag and see if that works.

gijs-koehorst commented 5 months ago

@ericstj I set the versions as follows:

<PackageReference Include="Microsoft.ML" Version="3.0.1"/>
<PackageReference Include="LightGBM" Version="3.3.5"/>

I built 3.3.5 lib_lightgbm.dylib from source and include it in bin.

Now LightGBM().Fit() gives "[LightGBM] [Fatal] Unknown importance type: only support split=0 and gain=1" and exits with 138.

ericstj commented 5 months ago

Can you double check your reference to Microsoft.ML.LightGBM to ensure it's also 3.0.1? I see from previous issues this might be related to using an 3.x LightGBM package with older Microsoft.ML.LightGBM: https://github.com/dotnet/machinelearning/issues/5447

If it's possible to share your simplified repro - or even just the project.assets.json we might be able to spot any other version discrepencies.

gijs-koehorst commented 5 months ago

Setting Microsoft.ML.LightGBM explicitly to 3.0.1 fixed the issue:

<PackageReference Include="Microsoft.ML" Version="3.0.1"/>
<PackageReference Include="LightGBM" Version="3.3.5"/>
<PackageReference Include="Microsoft.ML.LightGbm" Version="3.0.1" />

Thank you a lot for your help @ericstj.

ericstj commented 5 months ago

Thank you for sticking with me to find the problem. I think the fact that 4.2.0 created problems might mean that we have some breaking changes to react to in LightGBM (assuming you were using the latest ML.LightGBM package in that case). We'll get to those when updating LightGBM next.