Open superichmann opened 10 months ago
I only found one query from database log, where's the second query
oh sorry @LittleLittleCloud
first query:2023-11-30T20:35:51.591142Z I i.q.g.SqlCompilerImpl parse [fd=3436, q=xHole where date<'2017-01-01T00:00:00.000000Z']
second: 2023-11-30T20:36:04.985903Z I i.q.c.p.PGConnectionContext exec [fd=3436, q=xHole where date<'2017-01-01T00:00:00.000000Z']
@LittleLittleCloud any hope?
@superichmann Are you sure the first log is a query log? It seems to be a parsing sql log?
Hi again @LittleLittleCloud
I asked the guys at QuestDB database (which implements postgresql interface) and they say it is two separate queries. Maybe ml.net is making a pre-fetch or something?
As you can see for the validation set (called by Transform and Evaluate) there is only one line in the log:
2023-11-30T20:36:31.160893Z I i.q.g.SqlCompilerImpl parse [fd=3436, q=xHole where date<'2017-08-16T00:00:00.000000Z' AND date>='2017-01-01T00:00:00.000000Z']
what do you think?
@LittleLittleCloud @luisquintanilla Clarification: This issue relates to any databaseloader used by any ml , automl experiment (with cache) (even with maxmodels=1) or standalone FastForest or LightGbm. the issue is multiple unneeded database access.
If any further information is needed from my end please let me know. code snippet, entire walkthrough on how to reproduce, whatever, just let me know
@superichmann
Maybe set the columnsToPrefetch
when caching dataset?
// from
train = mlContext.Data.Cache(train);
// to
train = mlContext.Data.Cache(train, columnsToPrefetch: featuresArray);
Thought: Looking through your pipeline, there're two places where a query might be triggered
var pipeline = mlContext.Transforms.Categorical.OneHotEncoding(CATsArray.ToArray()) // this will trigger a query when transforming.
.Append((mlContext.Regression.Trainers.LightGbm(softOptions)))); // this also trigger a query.
When you Cache
a dataset, the rows won't be cached until it's been asked, and it caches dataset column by column. When fitting OneHotEncoding
, a set of category columns will be cached (query 1). And when fitting lightGBM trainer, Although category columns already been cached, numeric columns are still missing, which trigger another DB request (query 2)
How to validate my thought:
columnsToPrefetch
columns to all columns. This will prefetch all columns into cache prior to training your pipelineFit
twice and see if you get four sql query logs? Let me know if my thought is correct in any way
@LittleLittleCloud Thanks I am now checking.
I have removed the OneHotEncoding from the pipeline and now there is only one SELECT from the DB for Fit.
I have re introduced the OneHotEncoding and indeed it is now making 2 queries.
When incorporating columnsToPrefetch:featuresArray.ToArray()
the query occur in the Cache state and one time again in the Fit. (still two queries).
Usually I use automl experiment which automatically OneHotEncodes the data..
You think there might be a way to make it with one data query from the db?
Here are some further statistics I have collected (multiple runs) (same data same columns) (db restarted on each run) Normal Cache + OneHotEncode + Fit + transform = 45 seconds Columns Cache + OneHotEncode + Fit + transform = 55 seconds (Cache and Fit takes extra 5 seconds each)
@superichmann
When incorporating columnsToPrefetch:featuresArray.ToArray() the query occur in the Cache state and one time again in the Fit. (still two queries).
Can you also add label column in columnsToPrefetch
? The second Fit
is probably because label
column is missing when filling cached column. (Sorry that I forget to mention it in previous reply)
@LittleLittleCloud mmm it is already included in the list. all of the columns during ml are in the list. isnt it happening since the Cache function is lazy cache?
@superichmann Do you mean the label column is already included in featuresArray
? From the notebook you shared above the featuresArray
seems only contains features though
var softOptions = new LightGbmRegressionTrainer.Options
{
LabelColumnName = "sales",
FeatureColumnName = "Features",
BatchSize = 999999999
};
List<string> featuresArray = new List<string>();
featuresArray.AddRange(new List<string>(){"store_nbr","family","sf","city","local_type","local_desc","state","regional_type","regional_desc","national_type","national_desc","type","cluster","DayOfWeek","quarter","familysfOpen","sfOpen","sfPromotion","local_event","local_transferred","regional_event","regional_transferred","national_event","national_transferred","Weekend"});
featuresArray.AddRange(new List<string>(){"sfZero","id","onpromotion","transactions","dcoilwtico","doywoy","yearcount","monthCount","weekOfYear","DayOfMonth","daysCounter","monthProgress","dayOfYear","yearProgress","RANAD"});
List<string> categoricals = new List<string>(){"store_nbr","family","sf","city","local_type","local_desc","state","regional_type","regional_desc","national_type","national_desc","type","cluster","DayOfWeek","quarter","familysfOpen","sfOpen","sfPromotion","local_event","local_transferred","regional_event","regional_transferred","national_event","national_transferred","Weekend"};
List<InputOutputColumnPair> CATsArray = new List<InputOutputColumnPair>();
@LittleLittleCloud yes I know... I am using different automatic code that also incorporates the target column.. this notebook was just for the initial test. before we used columnsToPrefetch.
if you want I can create another notebook with comprehensive instructions and a database setup for you to check yourself.
@LittleLittleCloud 🙈
System Information (please complete the following information):
Describe the bug Fit method accesses the database two times instead of one time. Cache was set. Maximum BatchSize was set a 15 seconds delay happens between the two queries. Database server is in the same server with the running code, there is no load on the server.
To Reproduce if you dont have time to reproduce please just look at my ipynb code Create a database loader for the data, see this ipynb (change from json to ipynb) download QDB and install my data is from here but you can use your own data
Expected behavior The LightGbm trainer should query the database once and not twice.
Screenshots, Code, Sample Projects Database log: