PCA: technique for reducing the dimensionality of a dataset, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance. For quantitative datas only. This is often used to simplify exploratory analyses or to prepare data for Machine Learning pipelines.
FCA: the goal of this analysis is to define new variables that we can understand and interpret in a business / practical manner. We can rotate the solution until we find latent variables that have a clear interpretation and “make sense”. It's more conceptual than PCA. For datasets with target value that is categorical and predicting values that are numerical.
MCA: used to detect and represent underlying structures in a data set. In simple, it's PCA applied to categorical datas. For for nominal categorical data.
Clustering: unsupervised ML technique. grouping of objects such that objects in the same cluster are more similar to each other than they are to objects in another cluster. The classification into clusters is done using criteria such as smallest distances, density of data points, graphs, or various statistical distributions.
Linear Regression: used to predict the value of a target variable based on the value of another variable. For dataset where the outcome is a numeric variable, when we want to predict the evolution of a continuous variable.
Logistic Regression: used to predict the value of a target variable based on the value of another variable. For predicting a binary categorical outcome. For dataset with a reduced number of features.
Support Vector Machine (SVM): same conditions as for logistic regression, but with non-linear dataset (which can't be separated by a line). More efficient with many outliers. For datasets with a lot of features but not a lot of data.
Artificial Neural Network (ANN): more effective in high dimensional spaces, when number of dimensions is greater than number of samples (1000 columns > 600 rows). For dataset with a lot of features and a lot of data.
Deep Learning: for complex ANN models with millions of hidden layers. Asks for a lot of computing power.
recall value for classification models : true positive / all positive. Average of the ML classifier model performance. Tell us how much data were treated by our model.
PCA: technique for reducing the dimensionality of a dataset, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance. For quantitative datas only. This is often used to simplify exploratory analyses or to prepare data for Machine Learning pipelines.
FCA: the goal of this analysis is to define new variables that we can understand and interpret in a business / practical manner. We can rotate the solution until we find latent variables that have a clear interpretation and “make sense”. It's more conceptual than PCA. For datasets with target value that is categorical and predicting values that are numerical.
MCA: used to detect and represent underlying structures in a data set. In simple, it's PCA applied to categorical datas. For for nominal categorical data.
Clustering: unsupervised ML technique. grouping of objects such that objects in the same cluster are more similar to each other than they are to objects in another cluster. The classification into clusters is done using criteria such as smallest distances, density of data points, graphs, or various statistical distributions.
Linear Regression: used to predict the value of a target variable based on the value of another variable. For dataset where the outcome is a numeric variable, when we want to predict the evolution of a continuous variable.
Logistic Regression: used to predict the value of a target variable based on the value of another variable. For predicting a binary categorical outcome. For dataset with a reduced number of features.
Support Vector Machine (SVM): same conditions as for logistic regression, but with non-linear dataset (which can't be separated by a line). More efficient with many outliers. For datasets with a lot of features but not a lot of data.
Artificial Neural Network (ANN): more effective in high dimensional spaces, when number of dimensions is greater than number of samples (1000 columns > 600 rows). For dataset with a lot of features and a lot of data.
Deep Learning: for complex ANN models with millions of hidden layers. Asks for a lot of computing power.