dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.93k stars 1.86k forks source link

Splitter/consolidator worker encountered exception while consuming source data in QA #6911

Closed zewditu closed 5 months ago

zewditu commented 6 months ago

System Information (please complete the following information):

Describe the bug For this dataset the exception is thrown after the training is completed , it seems the issue is in validation step. I am able to reproduce it in Ml.Net repo test case and the issue happed in 'ComputeTopKSpansWithScore' method at https://github.com/dotnet/machinelearning/blob/main/src/Microsoft.ML.TorchSharp/Utils/MetricUtils.cs#L23

Here is the test code

        var ml = new Microsoft.ML.MLContext();

        ml.Log += (object sender, LoggingEventArgs e) =>
        {
            Console.WriteLine(e.Message);
        };

        ml.GpuDeviceId = 0;
        ml.FallbackToCpu = false;
        Console.WriteLine("Hello World!");

        var trainFile = GetDataPath("squad-train-clean.tsv");

        Microsoft.ML.Data.TextLoader textLoader =
            ml.Data.CreateTextLoader(new TextLoader.Options()
            {
                Columns = new[]
                        {
new TextLoader.Column("Context", DataKind.String,0),
new TextLoader.Column("Question", DataKind.String,1),
new TextLoader.Column("TrainingAnswer", DataKind.String,2),
new TextLoader.Column("AnswerIndex", DataKind.Int32,3)
},
                HasHeader = true,
                Separators = new[] { '\t' },
            }, new MultiFileSource(trainFile));

        Microsoft.ML.IDataView dataView = textLoader.Load(new MultiFileSource(trainFile));
        var testTrainSplit = ml.Data.TrainTestSplit(dataView, 0.95);

        var trainingDataset = testTrainSplit.TrainSet;
        var testDataset = ml.Data.TrainTestSplit(testTrainSplit.TestSet, 0.95).TrainSet;

        var estimator = ml.MulticlassClassification.Trainers.QuestionAnswer(maxEpochs: 1);
        var model = estimator.Fit(testDataset);
        var transformedData = model.Transform(testDataset);
        var labelCol = transformedData.GetColumn<string[]>("Answer").ToArray();

image

Additional context The reason we used test dataset to train is because it seems that the exception is thrown in validation using test dataset.

zewditu commented 5 months ago

@michaelgsharp investigated that the issue is because of the dataset. The dataset was opened in Excell and Excell adds extra quotes around text. However, Ml.Net should have the way to handle this. @michaelgsharp for more comments. @JakeRadMSFT @luisquintanilla thought?

michaelgsharp commented 5 months ago

Closing this issue as its related to invalid data and not ML.NET.