dotnet / machinelearning-modelbuilder

Simple UI tool to build custom machine learning models.
Creative Commons Attribution 4.0 International
263 stars 56 forks source link

model builder might produce erroneous models due to not able to specify `allowQuoting`. #2648

Open lgong-rms opened 1 year ago

lgong-rms commented 1 year ago

System Information (please complete the following information):

Describe the bug

        public static IDataView LoadIDataViewFromFile(MLContext mlContext, string inputDataFilePath, char separatorChar, bool hasHeader)
        {
            return mlContext.Data.LoadFromTextFile<ModelInput>(inputDataFilePath, separatorChar, hasHeader);
        }

The allowQuoting parameter for LoadFromTextFile() method is false by default so the text read in will be wrong if the input text file do include separators within double-quoted values. However, the model builder issues no warnings in such case and completes the training with an erroneous model.

To Reproduce Steps to reproduce the behavior: Just modify some of the text classification model builder tutorials by including separators within double-quoted text values and putting the label column after the feature columns. The label column will be messed up due to the separators within double-quoted text values.

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here.

LittleLittleCloud commented 1 year ago

Thanks for the suggestion, we'll add allowQuoting flag in next release

lgong-rms commented 1 year ago

It seems that the model builder CLI does not provide such an option for allowQuoting either so it needs to be added there as well:

>mlnet classification
Option '--dataset' is required.
Option '--label-col' is required.

classification
  Train a custom ML.NET model for classification. Learn more about classification at aka.ms/cli-classification.

Usage:
  mlnet [options] classification

Options:
  --dataset <dataset> (REQUIRED)             File path to single dataset or training dataset for train/test approaches.
  --label-col <label-col> (REQUIRED)         Name or zero-based index of label (target) column to predict.
  --cache <Auto|Off|On>                      Specify [On|Off|Auto] for cache to be turned on, off, or auto-determined (default). [default: Auto]
  --cv-fold <cv-fold>                        Number of folds used for cross-validation. Don't specify if --split-ratio or --validation-dataset are set.
  --has-header                               Specify [true|false] depending if dataset file(s) have header row. Use auto-detect if this flag is not set.
  --ignore-cols <ignore-cols>                Specify columns to be ignored in given dataset. Use space-seperated column names or zero-based indexes.
  --log-file-path <log-file-path>            Path to log file.
  --name <name>                              Name for output project or solution to create. Default is SampleClassification. [default: SampleClassification]
  -o, --output <output>                      Location folder for generated output. Default is current directory.
  --split-ratio <split-ratio>                Percent of dataset to use for validation. Range must be between 0 and 1. Don't specify if --cv-fold or --validation-dataset are set.
  --train-time <train-time>                  Maximum time in seconds for exploring models with best configuration. Default time is 100 sec. [default: 100]
  --validation-dataset <validation-dataset>  File path for validation dataset in train/validation approaches.
  -v, --verbosity <verbosity>                Output verbosity choices: q[uiet], m[inimal] (default) and diag[nostic]. [default: m]

Required options: --dataset, --label-col
LittleLittleCloud commented 1 year ago

Action items

LittleLittleCloud commented 1 year ago

@lgong-rms I fail to reproduce the error on dataset with quote. The dataset preview page looks correct when loading dataset with quote in it.

Image

Do you just mean the generated code might fail to load dataset because of missing allowQuoting?

luisquintanilla commented 1 year ago

Next steps:

LittleLittleCloud commented 1 year ago

Also

v-Hailishi commented 1 year ago

@LittleLittleCloud Verified on the latest main build 17.17.0.2337501, "allowQuoting" has been added successfully in advanced data option. image

But I didn't see any difference between choosing "Yes" or "No". Could you help to provide detailed datasets and check points to verify this function?

LittleLittleCloud commented 7 months ago

@v-Hailishi it probably means the parsing result from the test data set you use doesn't make a difference because it doesn't contains fields with quote.