cerlymarco / linear-tree

A python library to build Model Trees with Linear Models at the leaves.
MIT License
338 stars 54 forks source link

[performance suggestions?]Parallelism btw trees and replace linear fit to SGD with batch? #36

Closed HaoLi111 closed 10 months ago

HaoLi111 commented 10 months ago

It seems that the any 2 tree models in a forest can be trained in parallel, is there a way to do njobs=-1 in the parameter or wrap the entire thing into a with block passing in with joblib multiprocessing njob=-1?

Is it possible to replace linear fit with SGD fit for large scale data? Should we? (in terms of speed and model equivalence)

Also, is it possible to call gpu to solve linear each time(either the traditional way or the gradient based optimizers?)

I am thinking of this type of model, if applied on tabular data , can have tracable error sensitivity( because derivative or linear slopes are known, and jumps are finite). Maybe one thing to try is to use these model on a wide range biostats tabular datasets (some of them are very small(<2k obs, < 50 vars), but have good local correlations and need good interpretations). So I am planning to use it at scale.

cerlymarco commented 10 months ago

Hi,

1) LinearTree and LinearForest can be used with n_jobs using joblibs. Like any sklearn model, n_jobs=-1 means use all cores while n_jobs=2 means use two cores. If your machine has only two cores, it will run only on them

2) sklearn.linear_model.SGDClassifier/Regressor can be used as base_estimator in LinearTree. Performances depend on your task.

3) GPU training is not supported (like in sklearn).

If you support the project don't forget to leave a star ;-)

HaoLi111 commented 7 months ago

Thanks, but I made it work with 3(CUDA) by rewriting linear class with coef and intercept and score attributes using pytorch

Screenshot from 2023-11-29 21-24-41 Screenshot from 2023-11-29 21-24-52

But I guess: 1. for small amount of data, this is indeed very slow.(Use TorchMatrixBasedLinearRegression(device='cuda') for small ones

  1. maybe I should publish this "TorchLinearRegression" to another package?

update: linear-tree does run with these if we write the classes in a sklearn format https://www.kaggle.com/code/hli111111/sklearn-in-pytorch-with-gpu