In statistics, we often want to predict a response variable based on data. Binary classification
is one example of this setting where the response variable takes on two possible values.
Classification techniques then aim to classify this response, also known as the class, based
on data in a way that maximizes accuracy. We are particularly interested in the classification
of imbalanced data, a common data type in medical settings and fraud detection,
where the number of instances in each class drastically differs. Canonical classification
methods, such as classifiers and neural networks, often perform poorly on these imbalanced
datasets. We show that lower imbalance levels, where the disparity between the
number of instances in each class is large, affect the performance of harder classification
tasks more than easier classification tasks. Through this investigation of how imbalance
levels in both synthetic and real-world datasets affect classification performance, we can
better understand how to mitigate this issue.
Jasper Flint (2021):
The past decade has seen a massive increase in the number of digital images created.
These images represent a large source of potential data, but in their original form,
they cannot be analyzed by statistical techniques like supervised learning to give
the subject represented by an image. The process of feature extraction exists as
an intermediate step of analysis, transforming the high-dimensional original images
into low-dimensional vectors, called features, for use in classification. In this thesis,
we begin by introducing and detailing several feature extraction algorithms: Local
Binary Patterns (LBP), Gabor Filters, Histogram of Oriented Gradients (HOG), and
Oriented FAST and Rotated BRIEF (ORB). We then present several supervised
learning algorithms, called classifiers, and statistical tests to compare the performance
of these classifiers. Finally, we apply combinations of feature extraction methods and
classifiers to three large image sets with multiple classes. Our findings indicate a
significant difference in prediction accuracy among different feature extraction methods,
but not among different classification methods.
Kenny Chen (2022):
Categorical data analysis with ordinal responses is important in fields such as the
social sciences because when we take into consideration the intrinsic ordering of ordinal
variables, we can often obtain more powerful inferences. One step in categorical
analysis is exploring the various dependence structures among the variables for exploratory
modeling. A dependence structure of particular interest is that of the
regression dependence which many model-based approaches have been constructed.
However, there are comparatively fewer model-free approaches to examining dependence
structures in categorical data, and most of these do not focus on regression
dependence. To address this, Wei & Kim (2021) proposed a new model-free measure
based on the checkerboard copula and demonstrated its ability to identify and
quantify the regression dependence in multivariate categorical data with an ordinal
response variable and categorical (nominal or ordinal) explanatory variables in an
exploratory manner. This thesis explores their novel measure and the methodology
behind it. In addition, we extend their work by proposing a model-based estimator
of their measure. We conduct simulation studies to evaluate the performance of the
model-free and model-based measure. Initial results demonstrated that model-based
estimates of the measure from well-fitted models compared similarly to the model-free
estimator of the measure, suggesting further exploration into the possibility of using
the model-free estimator as a goodness of fit measure.
Andrea Boskovic (2021):
Jasper Flint (2021):
Kenny Chen (2022):