asu-cactus / netsdb

A system that seamlessly integrates Big Data processing and machine learning model serving in distributed relational database
Apache License 2.0
15 stars 5 forks source link

Reading LightGBM from CSV and suport sparse data and regression tasks #78

Closed hguan6 closed 1 year ago

hguan6 commented 1 year ago

Main updates:

  1. Support converting LightGBM pickle file to CSV files and loading LightGBM model from CSV files
  2. Dispatch predict() function according to user input (withMissing/withoutMissing values in data); the dispatching speeds up 1600 trees Randomforest model on Higgs dataset.
  3. Change the aggregation functions for all algorithms. I believe the aggregation function is wrong, so I changed it. The details are described in my previous document. The basic idea is simple, for xgboost and lightgbm, the aggregated predicted value is just the sum of all predicted leaf values. For random forest, the aggregated predicted value is the mean of all predicted leaf values. If it is a classification task, the forest outputs class one if the aggregated value is greater than 0.5 and class zero otherwise.
  4. Enable regression tasks and support the "Year" dataset.
  5. Support loading SVM data input to dense matrices.

Some code refactoring:

  1. Change returnClass in data structures to a union of threshold (for inner nodes) and leafValue (for leaf nodes).
  2. Refactor the function that loads the matrix from an input file to make it easier to understand.