dotnet / machinelearning-modelbuilder

Simple UI tool to build custom machine learning models.
Creative Commons Attribution 4.0 International
265 stars 56 forks source link

Spam/ham fails via CLI #1672

Closed beccamc closed 2 years ago

beccamc commented 3 years ago

I'm guessing the problem is that it's "boolean" classification, but using spam/ham instead of true/false or 0/1.

From the Log file: 2021-08-03 17:37:55.2373 DEBUG System.FormatException: String ' available trainer: LGBM, RF, FASTTREE, LBFGS, SDCA ' was not recognized as a valid Boolean. at System.Boolean.Parse(ReadOnlySpan`1 value) at System.Boolean.Parse(String value)

Spam dataset

beccamc commented 3 years ago

@LittleLittleCloud FYI on this issue.

LittleLittleCloud commented 3 years ago

It's something related to feature flag, I'll fix this.

LittleLittleCloud commented 3 years ago

The error 2021-08-03 17:37:55.2373 DEBUG System.FormatException: String ' available trainer: LGBM, RF, FASTTREE, LBFGS, SDCA ' was not recognized as a valid Boolean. is not fatal. It will just causes FF manager to return false and disable functions.

The fatal error is in data processing, where the dataset's header is determined as "false" by prose while it's true. However even setting --has-header=true doesn't resolve this. I believe it's because mlnet(or dataProcessing engine) doesn't respect that flag for some reason.

Meanwhile, model builder also can't train that dataset. Firstly, it shows that the dataset has 5 columns while it only has two. And after setting header flag to true, it still shows 5 headers and throws "An item with same key has already been added" somehow. And training also fails.

image

LittleLittleCloud commented 3 years ago

@beccamc Can you fix this in model builder side and I'll take over mlnet.cli side

vzhuqin commented 2 years ago

@beccamc Still have model builder error: An item with the same key has already been added. on ML.Net Model Builder: 16.9.1.2155901 (Main) Column Headers: No image.png Column Headers: Yes image.png

Not repro this issue on mlnet: 16.9.2 image.png

beccamc commented 2 years ago

The problem here isn't the spam/ham thing, but that there three empty columns.

image.png

beccamc commented 2 years ago

Two things to verify...

  1. As stated above the issue isn't actually with the spam/ham data. The problem is that the spam dataset linked has empty columns. PR 1347 adds an error message that makes a bit more sense. Note that for this dataset, the column name will be empty in the error message (it will say "Please remove or rename column ."). I think this is enough info for the customer to open their file, or enough info for us to figure out the problem.

image

  1. A spam/ham dataset without the extra columns works fine. See attached file spam2.csv for testing.
vzhuqin commented 2 years ago

Verify this issue on latest main: 16.9.1.2160801 For "Column Headers: Yes": Will note that "Cannot have multiple columns with same name. Please rename or remove column." image.png

For "Column Headers: No" and "dataset without extra columns", can complete training. image.png

image.png