Datasets used for producing benchmarks in scikit-learn intelex

vineel96 commented 1 year ago

Hello, Can I get the information of datasets used for producing benchmark results(speedup values) for different scikit-learn algorithms as shown in figure under Acceleration sub section at https://github.com/intel/scikit-learn-intelex . Image is also attached here: scikit-learn-acceleration-2021 2 3

Alexsandruss commented 1 year ago

Datasets are specified in this config: https://github.com/IntelPython/scikit-learn_bench/blob/master/configs/skl_config.json Data generation/loading functions are defined here: https://github.com/IntelPython/scikit-learn_bench/tree/master/datasets

vineel96 commented 1 year ago

Hi, Thank you for the links. So, all experiments in figure are done with synthetic datasets generated from sklearn's make_blobs (except for SVC and RF where dataset is mentioned) using this script https://github.com/IntelPython/scikit-learn_bench/blob/master/datasets/make_datasets.py right?

Alexsandruss commented 1 year ago

Hi, Thank you for the links. So, all experiments in figure are done with synthetic datasets generated from sklearn's make_blobs (except for SVC and RF where dataset is mentioned) using this script https://github.com/IntelPython/scikit-learn_bench/blob/master/datasets/make_datasets.py right?

Yes, that's right.

vineel96 commented 1 year ago

Thanks for the information

vineel96 commented 1 year ago

Hi @Alexsandruss, For the inference,

which data is used for kmeans (there is no "testing" attribute for kmeans in skl_config.json)
For knn, training and testing samples are generated seperately or training samples are itself used for testing?
for knn-kdt, linear regression, ridge regression, there is no testing data info is provided, so which data is used for inference?
for random forest and svc there is no info provided for train and test split. Which data is used for inference?
In inference speedup graph, dbscan algorithm is not shown, why?

Alexsandruss commented 1 year ago

1-4. If 'testing' field is not provided, than data is same for training and inference. Train and test split is defined in data loaders for named datasets.

sklearn's DBSCAN doesn't have separate function for inference

vineel96 commented 1 year ago

Hi @Alexsandruss , 1-4. Generally we use different data for inference and training right? Is it ok to use same training data for inference also? For named datasets, example higgs_one_m for random forest, in the above speedup graph it shows size of data as 1M for both inference graph and training graph. But in loader_classification.py(in datasets folder), it shows different split for train as (1000000, 28) and inference as (500000, 28). So which split is actually used in inference speedup graph? (this is same for all named dataset)

So which function is used for dbscan in training speedup graph, fit() or fit_predict()?
For knn kdtree, there is no fit() function. So in training speedup graph, only object creation KDTree() is considered for timing or any other is used? Also for inference which function is used? is tree.query() is used in inference?
Also can you provide parameter information that was used for each algorithm while generating above speedup graph? Like for SVC and RF? I see for other algorithms parameters info is given in skl_config.json.
Also what's "time_method", "time_limit" for kmeans in skl_config.json file? Also n_clusters in it refers to initial no of clusters?

IntelPython / scikit-learn_bench

Datasets used for producing benchmarks in scikit-learn intelex #135