model builder might produce erroneous models due to not able to specify `allowQuoting`.

lgong-rms commented 1 year ago

System Information (please complete the following information):

Model Builder Version (available in Manage Extensions dialog): 17.14.4.2312404
Visual Studio Version: 17.6.0

Describe the bug

On which step of the process did you run into an issue: Data
Clear description of the problem: there is nowhere to specify allowQuoting when loading from a text file so the generated code is

        public static IDataView LoadIDataViewFromFile(MLContext mlContext, string inputDataFilePath, char separatorChar, bool hasHeader)
        {
            return mlContext.Data.LoadFromTextFile<ModelInput>(inputDataFilePath, separatorChar, hasHeader);
        }

The allowQuoting parameter for LoadFromTextFile() method is false by default so the text read in will be wrong if the input text file do include separators within double-quoted values. However, the model builder issues no warnings in such case and completes the training with an erroneous model.

To Reproduce Steps to reproduce the behavior: Just modify some of the text classification model builder tutorials by including separators within double-quoted text values and putting the label column after the feature columns. The label column will be messed up due to the separators within double-quoted text values.

Expected behavior A clear and concise description of what you expected to happen.

add an option to specify allowQuoting at the Data step and
issue a warning/error if it is set to false but the input does include separators within double-quoted values.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here.

LittleLittleCloud commented 1 year ago

Thanks for the suggestion, we'll add allowQuoting flag in next release

lgong-rms commented 1 year ago

It seems that the model builder CLI does not provide such an option for allowQuoting either so it needs to be added there as well:

>mlnet classification
Option '--dataset' is required.
Option '--label-col' is required.

classification
  Train a custom ML.NET model for classification. Learn more about classification at aka.ms/cli-classification.

Usage:
  mlnet [options] classification

Options:
  --dataset <dataset> (REQUIRED)             File path to single dataset or training dataset for train/test approaches.
  --label-col <label-col> (REQUIRED)         Name or zero-based index of label (target) column to predict.
  --cache <Auto|Off|On>                      Specify [On|Off|Auto] for cache to be turned on, off, or auto-determined (default). [default: Auto]
  --cv-fold <cv-fold>                        Number of folds used for cross-validation. Don't specify if --split-ratio or --validation-dataset are set.
  --has-header                               Specify [true|false] depending if dataset file(s) have header row. Use auto-detect if this flag is not set.
  --ignore-cols <ignore-cols>                Specify columns to be ignored in given dataset. Use space-seperated column names or zero-based indexes.
  --log-file-path <log-file-path>            Path to log file.
  --name <name>                              Name for output project or solution to create. Default is SampleClassification. [default: SampleClassification]
  -o, --output <output>                      Location folder for generated output. Default is current directory.
  --split-ratio <split-ratio>                Percent of dataset to use for validation. Range must be between 0 and 1. Don't specify if --cv-fold or --validation-dataset are set.
  --train-time <train-time>                  Maximum time in seconds for exploring models with best configuration. Default time is 100 sec. [default: 100]
  --validation-dataset <validation-dataset>  File path for validation dataset in train/validation approaches.
  -v, --verbosity <verbosity>                Output verbosity choices: q[uiet], m[inimal] (default) and diag[nostic]. [default: m]

Required options: --dataset, --label-col

LittleLittleCloud commented 1 year ago

Action items

[x] add "allow Quote" in advanced data option
[ ] add "allow Quote” in cli

LittleLittleCloud commented 1 year ago

@lgong-rms I fail to reproduce the error on dataset with quote. The dataset preview page looks correct when loading dataset with quote in it.

Do you just mean the generated code might fail to load dataset because of missing allowQuoting?

luisquintanilla commented 1 year ago

Next steps:

Add allowQuoting to data loading code-behind

LittleLittleCloud commented 1 year ago

Also

add allowQuoting in model builder UI

v-Hailishi commented 1 year ago

@LittleLittleCloud Verified on the latest main build 17.17.0.2337501, "allowQuoting" has been added successfully in advanced data option.

But I didn't see any difference between choosing "Yes" or "No". Could you help to provide detailed datasets and check points to verify this function?

LittleLittleCloud commented 7 months ago

@v-Hailishi it probably means the parsing result from the test data set you use doesn't make a difference because it doesn't contains fields with quote.

dotnet / machinelearning-modelbuilder

model builder might produce erroneous models due to not able to specify `allowQuoting`. #2648