firefly-cpp / NiaAML

Python automated machine learning framework.
MIT License
29 stars 12 forks source link

Data squashing #89

Open firefly-cpp opened 3 months ago

firefly-cpp commented 3 months ago

Adding data squashing as a preprocessing method in the pipeline is also worth adding (probably useful).

It is already implemented here: https://github.com/firefly-cpp/arm-preprocessing

LaurenzBeck commented 2 months ago

what does the squasching operation do? I found that arm-preprocessing just calls https://github.com/firefly-cpp/NiaARM/blob/main/niaarm/preprocessing.py#L34 Can this be implemented as a FeatureTransformAlgorithm?

firefly-cpp commented 2 months ago

""Data squashing is a preprocessing method that enables construction of smaller datasets from the original ones and provides approximately the same results of data analysis as the original."

LaurenzBeck commented 1 month ago

I just revisited the ticket.

Based on my understanding of the method, it does neither fit into the category of feature_selection_algorithms, nor feature_transform_algorithms. I think a cleaner option would be to introduce a sample_selection or dataset_pruning component class with possible implementations:

Optionally, one could also repurpose feature_transform_algorithms into a general preprocessing component class.

Either way, Given that most users probably work with rather small datasets (as larger ones are in my experience the exception) and the current run-times are acceptable, I think my time on this project is better spent on the other tickets.