Not sure how much the KKT check is adding, but when number of KKT passed is not 0 or all lambdas in the batch, we have to recompute the gradient again. This can be avoided if we store the gradients when we check KKT. Based on preliminary study, as lambda goes down, all lambdas pass KKT, so we might save memory here..
Not sure how much the KKT check is adding, but when number of KKT passed is not 0 or all lambdas in the batch, we have to recompute the gradient again. This can be avoided if we store the gradients when we check KKT. Based on preliminary study, as lambda goes down, all lambdas pass KKT, so we might save memory here..