dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.99k stars 1.88k forks source link

Question: improve performance of reading files? #6024

Open torronen opened 2 years ago

torronen commented 2 years ago

I am running LightGbmBinaryTrainer through AutoML API. The start of training is slow. Are there ways to make it faster? CPU usage is less than 1 core, HDD reading is also very low. The subsequent experiments seem much faster.

For example, should I read the data to memory before starting AutoML.

  var loadOptions = columnInference.TextLoaderOptions;
            loadOptions.UseThreads = true;
            TextLoader textLoader = mlContext.Data.CreateTextLoader(loadOptions);

This is where it seems to spend most of time, on every pause. I have not run profiling though (it has some issues) image

torronen commented 2 years ago

I am doing shuffle, and it seems it might be part cause for the low performance. Could that be the case? TrainDataView = mlContext.Data.ShuffleRows(TrainDataView, 47, 50000, true);

I might be mistaken, but LightGBM, might not benefit from shuffling. Anyone knows if it is correct?