StarRocks / starrocks

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.
https://starrocks.io
Apache License 2.0
8.26k stars 1.67k forks source link

[function] Request support for machine learning algorithms #37112

Open jixxiong opened 6 months ago

jixxiong commented 6 months ago

Feature request

Is your feature request related to a problem? Please describe.

It seems that StarRocks does not currently support basic machine learning methods such as linear regression and xgboost.

Describe the solution you'd like

Is there any plan to build a basic framework that supports machine learning methods and introduce some machine learning algorithms?

For example, using the linear regression method, we need two functions: a training function and an inference function.

Assuming that the table test_tbl has three double columns, Y, X1, and X2. We insert some values (3, 1, 2), (5, 2, 3), (7, 3, 4).

First, for the training function, we can think of it as an aggregate function that aggregates all rows and uses the least squares method to calculate the model parameters W. For example,

select linear_regression_train(Y, [X1, X2]) from test_tbl;

This command will result in a JSON output containing the model parameters, indicating that this linear function should be Y=0.67+0.67X1+1.33X2.

{"W": [0.67, 0.67, 1.33]}

Based on this model, we can make inferences. The inference process can be thought of as a special table function, which is different from a normal table function in that it requires the trained model as an input and calculates the data based on the model. Finally, for each row of data input, a result is calculated. For example, if we use

with model as ( 
    select linear_regression_train(Y, [X1, X2], true) as params 
    from test_tbl 
) 
select linear_regression_predict(model.params, [X1, X2]) from test_tbl;

This command uses the trained parameters for inference, so the calculation process is to substitute each row's X1, X2 into the linear function obtained earlier. We can get the following results: 3 5 7.

Some details:

Describe alternatives you've considered

Additional context

github-actions[bot] commented 1 week ago

We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!