Open lgong-rms opened 1 year ago
Thanks for the suggestion, we'll add allowQuoting
flag in next release
It seems that the model builder CLI does not provide such an option for allowQuoting
either so it needs to be added there as well:
>mlnet classification
Option '--dataset' is required.
Option '--label-col' is required.
classification
Train a custom ML.NET model for classification. Learn more about classification at aka.ms/cli-classification.
Usage:
mlnet [options] classification
Options:
--dataset <dataset> (REQUIRED) File path to single dataset or training dataset for train/test approaches.
--label-col <label-col> (REQUIRED) Name or zero-based index of label (target) column to predict.
--cache <Auto|Off|On> Specify [On|Off|Auto] for cache to be turned on, off, or auto-determined (default). [default: Auto]
--cv-fold <cv-fold> Number of folds used for cross-validation. Don't specify if --split-ratio or --validation-dataset are set.
--has-header Specify [true|false] depending if dataset file(s) have header row. Use auto-detect if this flag is not set.
--ignore-cols <ignore-cols> Specify columns to be ignored in given dataset. Use space-seperated column names or zero-based indexes.
--log-file-path <log-file-path> Path to log file.
--name <name> Name for output project or solution to create. Default is SampleClassification. [default: SampleClassification]
-o, --output <output> Location folder for generated output. Default is current directory.
--split-ratio <split-ratio> Percent of dataset to use for validation. Range must be between 0 and 1. Don't specify if --cv-fold or --validation-dataset are set.
--train-time <train-time> Maximum time in seconds for exploring models with best configuration. Default time is 100 sec. [default: 100]
--validation-dataset <validation-dataset> File path for validation dataset in train/validation approaches.
-v, --verbosity <verbosity> Output verbosity choices: q[uiet], m[inimal] (default) and diag[nostic]. [default: m]
Required options: --dataset, --label-col
Action items
@lgong-rms I fail to reproduce the error on dataset with quote. The dataset preview page looks correct when loading dataset with quote in it.
Do you just mean the generated code might fail to load dataset because of missing allowQuoting
?
Next steps:
allowQuoting
to data loading code-behind Also
@LittleLittleCloud Verified on the latest main build 17.17.0.2337501, "allowQuoting" has been added successfully in advanced data option.
But I didn't see any difference between choosing "Yes" or "No". Could you help to provide detailed datasets and check points to verify this function?
@v-Hailishi it probably means the parsing result from the test data set you use doesn't make a difference because it doesn't contains fields with quote.
System Information (please complete the following information):
Describe the bug
allowQuoting
when loading from a text file so the generated code isThe
allowQuoting
parameter forLoadFromTextFile()
method isfalse
by default so the text read in will be wrong if the input text file do include separators within double-quoted values. However, the model builder issues no warnings in such case and completes the training with an erroneous model.To Reproduce Steps to reproduce the behavior: Just modify some of the text classification model builder tutorials by including separators within double-quoted text values and putting the label column after the feature columns. The label column will be messed up due to the separators within double-quoted text values.
Expected behavior A clear and concise description of what you expected to happen.
allowQuoting
at the Data step andfalse
but the input does include separators within double-quoted values.Screenshots If applicable, add screenshots to help explain your problem.
Additional context Add any other context about the problem here.