dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.02k stars 1.88k forks source link

LightGbm Regression Trainer Multiple Unneeded Database Access #6897

Open superichmann opened 10 months ago

superichmann commented 10 months ago

System Information (please complete the following information):

Describe the bug Fit method accesses the database two times instead of one time. Cache was set. Maximum BatchSize was set a 15 seconds delay happens between the two queries. Database server is in the same server with the running code, there is no load on the server.

To Reproduce if you dont have time to reproduce please just look at my ipynb code Create a database loader for the data, see this ipynb (change from json to ipynb) download QDB and install my data is from here but you can use your own data

Expected behavior The LightGbm trainer should query the database once and not twice.

Screenshots, Code, Sample Projects Database log:

2023-11-30T20:35:42.275822Z I server-main os scheduled worker started [name=ilpwriter_0]
2023-11-30T20:35:42.823066Z A server-main enjoy
2023-11-30T20:35:51.589288Z I pg-server connected [ip=127.0.0.1, fd=3436]
2023-11-30T20:35:51.591142Z I i.q.g.SqlCompilerImpl parse [fd=3436, q=xHole where date<'2017-01-01T00:00:00.000000Z']
2023-11-30T20:35:51.600643Z I i.q.g.SqlCompilerImpl plan [q=`select-choose date, store_nbr, family, sfZero, id, sf, sales, onpromotion, sfOpen, sfPromotion, transactions, dcoilwtico, city, local_event, local_type, local_desc, local_transferred, state, regional_event, regional_type, regional_desc, regional_transferred, national_event, national_type, national_desc, national_transferred, type, cluster, doywoy, yearcount, monthCount, weekOfYear, DayOfWeek, DayOfMonth, daysCounter, monthProgress, dayOfYear, yearProgress, Weekend, quarter, RANAD, familysfOpen from (select [date, store_nbr, family, sfZero, id, sf, sales, onpromotion, sfOpen, sfPromotion, transactions, dcoilwtico, city, local_event, local_type, local_desc, local_transferred, state, regional_event, regional_type, regional_desc, regional_transferred, national_event, national_type, national_desc, national_transferred, type, cluster, doywoy, yearcount, monthCount, weekOfYear, DayOfWeek, DayOfMonth, daysCounter, monthProgress, dayOfYear, yearProgress, Weekend, quarter, RANAD, familysfOpen] from xHole timestamp (date) where date < '2017-01-01T00:00:00.000000Z')`, fd=3436]
2023-11-30T20:35:51.605458Z I i.q.c.TableReader open partition C:\qdbroot\db\xHole\2013 [rowCount=650430, partitionNameTxn=-1, transientRowCount=433026, partitionIndex=0, partitionCount=5]
2023-11-30T20:35:51.608417Z I i.q.c.TableReader open partition C:\qdbroot\db\xHole\2014 [rowCount=655776, partitionNameTxn=-1, transientRowCount=433026, partitionIndex=1, partitionCount=5]
2023-11-30T20:35:51.611011Z I i.q.c.TableReader open partition C:\qdbroot\db\xHole\2015 [rowCount=650430, partitionNameTxn=-1, transientRowCount=433026, partitionIndex=2, partitionCount=5]
2023-11-30T20:35:51.613647Z I i.q.c.TableReader open partition C:\qdbroot\db\xHole\2016 [rowCount=669042, partitionNameTxn=-1, transientRowCount=433026, partitionIndex=3, partitionCount=5]
2023-11-30T20:36:04.985713Z I i.q.g.SqlCompilerImpl parse [fd=-1, q=DISCARD ALL]
2023-11-30T20:36:04.985903Z I i.q.c.p.PGConnectionContext exec [fd=3436, q=xHole where date<'2017-01-01T00:00:00.000000Z']
2023-11-30T20:36:04.985926Z I i.q.c.p.PGConnectionContext query cache used [fd=3436]
2023-11-30T20:36:31.160548Z I i.q.g.SqlCompilerImpl parse [fd=-1, q=DISCARD ALL]
2023-11-30T20:36:31.160893Z I i.q.g.SqlCompilerImpl parse [fd=3436, q=xHole where date<'2017-08-16T00:00:00.000000Z' AND date>='2017-01-01T00:00:00.000000Z']
2023-11-30T20:36:31.163041Z I i.q.g.SqlCompilerImpl plan [q=`select-choose date, store_nbr, family, sfZero, id, sf, sales, onpromotion, sfOpen, sfPromotion, transactions, dcoilwtico, city, local_event, local_type, local_desc, local_transferred, state, regional_event, regional_type, regional_desc, regional_transferred, national_event, national_type, national_desc, national_transferred, type, cluster, doywoy, yearcount, monthCount, weekOfYear, DayOfWeek, DayOfMonth, daysCounter, monthProgress, dayOfYear, yearProgress, Weekend, quarter, RANAD, familysfOpen from (select [date, store_nbr, family, sfZero, id, sf, sales, onpromotion, sfOpen, sfPromotion, transactions, dcoilwtico, city, local_event, local_type, local_desc, local_transferred, state, regional_event, regional_type, regional_desc, regional_transferred, national_event, national_type, national_desc, national_transferred, type, cluster, doywoy, yearcount, monthCount, weekOfYear, DayOfWeek, DayOfMonth, daysCounter, monthProgress, dayOfYear, yearProgress, Weekend, quarter, RANAD, familysfOpen] from xHole timestamp (date) where date < '2017-08-16T00:00:00.000000Z' and date >= '2017-01-01T00:00:00.000000Z')`, fd=3436]
2023-11-30T20:36:31.164073Z I i.q.c.TableReader open partition C:\qdbroot\db\xHole\2017 [rowCount=433026, partitionNameTxn=-1, transientRowCount=433026, partitionIndex=4, partitionCount=5]

image

LittleLittleCloud commented 10 months ago

I only found one query from database log, where's the second query

superichmann commented 10 months ago

oh sorry @LittleLittleCloud first query:2023-11-30T20:35:51.591142Z I i.q.g.SqlCompilerImpl parse [fd=3436, q=xHole where date<'2017-01-01T00:00:00.000000Z'] second: 2023-11-30T20:36:04.985903Z I i.q.c.p.PGConnectionContext exec [fd=3436, q=xHole where date<'2017-01-01T00:00:00.000000Z']

superichmann commented 10 months ago

@LittleLittleCloud any hope?

LittleLittleCloud commented 10 months ago

@superichmann Are you sure the first log is a query log? It seems to be a parsing sql log?

superichmann commented 10 months ago

Hi again @LittleLittleCloud I asked the guys at QuestDB database (which implements postgresql interface) and they say it is two separate queries. Maybe ml.net is making a pre-fetch or something? As you can see for the validation set (called by Transform and Evaluate) there is only one line in the log: 2023-11-30T20:36:31.160893Z I i.q.g.SqlCompilerImpl parse [fd=3436, q=xHole where date<'2017-08-16T00:00:00.000000Z' AND date>='2017-01-01T00:00:00.000000Z'] what do you think?

superichmann commented 9 months ago

@LittleLittleCloud @luisquintanilla Clarification: This issue relates to any databaseloader used by any ml , automl experiment (with cache) (even with maxmodels=1) or standalone FastForest or LightGbm. the issue is multiple unneeded database access.

If any further information is needed from my end please let me know. code snippet, entire walkthrough on how to reproduce, whatever, just let me know

LittleLittleCloud commented 9 months ago

@superichmann Maybe set the columnsToPrefetch when caching dataset?

// from
train = mlContext.Data.Cache(train);

// to 
train = mlContext.Data.Cache(train, columnsToPrefetch: featuresArray);

Thought: Looking through your pipeline, there're two places where a query might be triggered

var pipeline = mlContext.Transforms.Categorical.OneHotEncoding(CATsArray.ToArray()) // this will trigger a query when transforming.

.Append((mlContext.Regression.Trainers.LightGbm(softOptions)))); // this also trigger a query.

When you Cache a dataset, the rows won't be cached until it's been asked, and it caches dataset column by column. When fitting OneHotEncoding, a set of category columns will be cached (query 1). And when fitting lightGBM trainer, Although category columns already been cached, numeric columns are still missing, which trigger another DB request (query 2)

How to validate my thought:

Let me know if my thought is correct in any way

superichmann commented 9 months ago

@LittleLittleCloud Thanks I am now checking.

I have removed the OneHotEncoding from the pipeline and now there is only one SELECT from the DB for Fit. I have re introduced the OneHotEncoding and indeed it is now making 2 queries. When incorporating columnsToPrefetch:featuresArray.ToArray() the query occur in the Cache state and one time again in the Fit. (still two queries).

Usually I use automl experiment which automatically OneHotEncodes the data..

You think there might be a way to make it with one data query from the db?

Here are some further statistics I have collected (multiple runs) (same data same columns) (db restarted on each run) Normal Cache + OneHotEncode + Fit + transform = 45 seconds Columns Cache + OneHotEncode + Fit + transform = 55 seconds (Cache and Fit takes extra 5 seconds each)

LittleLittleCloud commented 9 months ago

@superichmann

When incorporating columnsToPrefetch:featuresArray.ToArray() the query occur in the Cache state and one time again in the Fit. (still two queries).

Can you also add label column in columnsToPrefetch? The second Fit is probably because label column is missing when filling cached column. (Sorry that I forget to mention it in previous reply)

superichmann commented 9 months ago

@LittleLittleCloud mmm it is already included in the list. all of the columns during ml are in the list. isnt it happening since the Cache function is lazy cache?

LittleLittleCloud commented 9 months ago

@superichmann Do you mean the label column is already included in featuresArray? From the notebook you shared above the featuresArray seems only contains features though

var softOptions = new LightGbmRegressionTrainer.Options
{
LabelColumnName = "sales",
FeatureColumnName = "Features",
BatchSize = 999999999
};

List<string> featuresArray = new List<string>();
featuresArray.AddRange(new List<string>(){"store_nbr","family","sf","city","local_type","local_desc","state","regional_type","regional_desc","national_type","national_desc","type","cluster","DayOfWeek","quarter","familysfOpen","sfOpen","sfPromotion","local_event","local_transferred","regional_event","regional_transferred","national_event","national_transferred","Weekend"});
featuresArray.AddRange(new List<string>(){"sfZero","id","onpromotion","transactions","dcoilwtico","doywoy","yearcount","monthCount","weekOfYear","DayOfMonth","daysCounter","monthProgress","dayOfYear","yearProgress","RANAD"});
List<string> categoricals = new List<string>(){"store_nbr","family","sf","city","local_type","local_desc","state","regional_type","regional_desc","national_type","national_desc","type","cluster","DayOfWeek","quarter","familysfOpen","sfOpen","sfPromotion","local_event","local_transferred","regional_event","regional_transferred","national_event","national_transferred","Weekend"};
List<InputOutputColumnPair> CATsArray = new List<InputOutputColumnPair>();
superichmann commented 9 months ago

@LittleLittleCloud yes I know... I am using different automatic code that also incorporates the target column.. this notebook was just for the initial test. before we used columnsToPrefetch.

if you want I can create another notebook with comprehensive instructions and a database setup for you to check yourself.

superichmann commented 8 months ago

@LittleLittleCloud 🙈