dcs-sastra / Kosaksi-Pasapugazh-and-experiments

Apache License 2.0
3 stars 5 forks source link

R2 dips after tuning different ways in which the feature can bucketed #1

Open anjineyulutv opened 3 months ago

anjineyulutv commented 3 months ago

We can plot the loss curve against different empirical bin size and interpolate in the loss curve which is the optimal bin size @likespeanuts .Let me know if you need more clarifications

anjineyulutv commented 3 months ago

This approach involves the following steps:

Define a range of bin sizes to evaluate. For each bin size: a. Bin the feature(s) using the specified bin size. b. Train a machine learning model (e.g., regression or classification) using the binned features. c. Compute and store the loss (e.g., mean squared error for regression, log loss for classification) on a validation set. Plot the loss curve against the bin sizes. Interpolate the loss curve to find the bin size that corresponds to the minimum loss.

download

likespeanuts commented 3 months ago

okay @anjineyulutv i will look into it il let you know if you need more clarifications..

likespeanuts commented 3 months ago

@anjineyulutv I think im done image image

likespeanuts commented 3 months ago

i maximized r^2 and minimized MSE

anjineyulutv commented 3 months ago

So does it improve R2 @likespeanuts

likespeanuts commented 3 months ago

somehow it doesn't i used Decision Tree Regressor fort this cause clade recommended it il try and switch up models maybe that might help..

anjineyulutv commented 3 months ago

Please try to not change the algorithm for now. Try getting the optimal bin size for linear regression only

likespeanuts commented 3 months ago

@anjineyulutv i have updated it, r^2 value hasn't improved still.. i need to review the code and will update further tomorrow..

anjineyulutv commented 3 months ago

@likespeanuts I want you to think on the lines of, given R2 value by business ,get a bin size ,via sort of backpropagation. Think about it. Please take time :) You can also think about variable size binning. You can think like cut the feature into sections based on the curvature say into N curvatures and bin each curvature into M bins and assign global bins from 1 to M*N .Let's see how it goes