P8-Team / App

0 stars 0 forks source link

Feature Selection til classification #24

Closed dkalaxdk closed 2 years ago

dkalaxdk commented 2 years ago

Målet er at uvælge features der hæver hvor akurat den enelige model er uden at overfit den.

soenderby commented 2 years ago

Conclusion tsfresh is a library for automatically selecting and extracting features from time series datasets. A very nice example of how the library is used can be seen here. Using this all we need to do is create a pandas DataFrame with the format described here. As can be seen in the example, it is also necessary to create a vector containing the class labels for the training data.

One thing to note is that I believe the details of what the library does are beyond the scope of our project. This means that we will likely be unable to explain how/why the specific features are created/selected in any meaningful detail.

The library is also described in this paper which can be cited using the following bibtex. @article{CHRIST201872, title = {Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python package)}, journal = {Neurocomputing}, volume = {307}, pages = {72-77}, year = {2018}, issn = {0925-2312}, doi = {https://doi.org/10.1016/j.neucom.2018.03.067}, url = {https://www.sciencedirect.com/science/article/pii/S0925231218304843}, author = {Maximilian Christ and Nils Braun and Julius Neuffer and Andreas W. Kempa-Liehr}, keywords = {Feature engineering, Time series, Feature extraction, Feature selection, Machine learning}, }

The algorithm used is also described in this paper which can be cited using: @article{DBLP:journals/corr/ChristKF16, author = {Maximilian Christ and Andreas W. Kempa{-}Liehr and Michael Feindt}, title = {Distributed and parallel time series feature extraction for industrial big data applications}, journal = {CoRR}, volume = {abs/1610.07717}, year = {2016}, url = {http://arxiv.org/abs/1610.07717}, eprinttype = {arXiv}, eprint = {1610.07717}, timestamp = {Sat, 23 Jan 2021 01:12:57 +0100}, biburl = {https://dblp.org/rec/journals/corr/ChristKF16.bib}, }

Should we wish to do something simpler, meaning easier to understand and explain, there is the option of scikitlearn. While this library would allows us greater control over the whole feature manipulation process, it will require a great deal more work to implement and research to understand.

The entirety of the documentation/research can be found here (note: it may not be very well structured)

soenderby commented 2 years ago

What is feature selection?

Basic explanation courtesy of wikipedia: “In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. “

Also mentions how it differs from feature extraction: “Feature selection techniques should be distinguished from feature extraction. Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features.”

How is it done? Math and/or magic. There are numerous algorithms and ways to analyze features to determine if they are relevant.

This is a table taken from this paper. image

dkalaxdk commented 2 years ago

Approved.

TrineHolmager commented 2 years ago

Approved