SuperCowPowers / zat

Zeek Analysis Tools (ZAT): Processing and analysis of Zeek network data with Pandas, scikit-learn, Kafka and Spark
MIT License
423 stars 107 forks source link

How to make sure the dataframe_to_matrix function perform same on the data with same structure? #135

Closed Wapiti08 closed 2 years ago

Wapiti08 commented 2 years ago

There are automatic normalization function and encoding functions inside the function.

But if I want to use them separately on train and test dataset, I need to make sure the same scaler and encoding method for two datasets. I found there is no such single fit function for me to save the scaler.

Is there any way to achieve by requirements based on ZAT?

Hope to get some advice on that.

Wapiti08 commented 2 years ago

I found solution for this question, just by reusing the to_matrix

brifordwylie commented 2 years ago

@Wapiti08 it's a good question. If you look here https://github.com/SuperCowPowers/zat/blob/master/zat/dataframe_to_matrix.py. the class has two methods fit_transform and transform. When you're using training data use fit_transform and when you doing evaluation/prediction you should just use transform. It might be good for me to make a notebook explaining this :)

Wapiti08 commented 2 years ago

I found that example. It helps me to solve my concern (kind of mind shift). Thanks a lot again