dashaasienga / Statistics-Senior-Honors-Thesis

0 stars 0 forks source link

abstract examples #37

Closed katcorr closed 2 months ago

katcorr commented 3 months ago

Andrea Boskovic (2021):

In statistics, we often want to predict a response variable based on data. Binary classification is one example of this setting where the response variable takes on two possible values. Classification techniques then aim to classify this response, also known as the class, based on data in a way that maximizes accuracy. We are particularly interested in the classification of imbalanced data, a common data type in medical settings and fraud detection, where the number of instances in each class drastically differs. Canonical classification methods, such as classifiers and neural networks, often perform poorly on these imbalanced datasets. We show that lower imbalance levels, where the disparity between the number of instances in each class is large, affect the performance of harder classification tasks more than easier classification tasks. Through this investigation of how imbalance levels in both synthetic and real-world datasets affect classification performance, we can better understand how to mitigate this issue.

Jasper Flint (2021):

The past decade has seen a massive increase in the number of digital images created. These images represent a large source of potential data, but in their original form, they cannot be analyzed by statistical techniques like supervised learning to give the subject represented by an image. The process of feature extraction exists as an intermediate step of analysis, transforming the high-dimensional original images into low-dimensional vectors, called features, for use in classification. In this thesis, we begin by introducing and detailing several feature extraction algorithms: Local Binary Patterns (LBP), Gabor Filters, Histogram of Oriented Gradients (HOG), and Oriented FAST and Rotated BRIEF (ORB). We then present several supervised learning algorithms, called classifiers, and statistical tests to compare the performance of these classifiers. Finally, we apply combinations of feature extraction methods and classifiers to three large image sets with multiple classes. Our findings indicate a significant difference in prediction accuracy among different feature extraction methods, but not among different classification methods.

Kenny Chen (2022):

Categorical data analysis with ordinal responses is important in fields such as the social sciences because when we take into consideration the intrinsic ordering of ordinal variables, we can often obtain more powerful inferences. One step in categorical analysis is exploring the various dependence structures among the variables for exploratory modeling. A dependence structure of particular interest is that of the regression dependence which many model-based approaches have been constructed. However, there are comparatively fewer model-free approaches to examining dependence structures in categorical data, and most of these do not focus on regression dependence. To address this, Wei & Kim (2021) proposed a new model-free measure based on the checkerboard copula and demonstrated its ability to identify and quantify the regression dependence in multivariate categorical data with an ordinal response variable and categorical (nominal or ordinal) explanatory variables in an exploratory manner. This thesis explores their novel measure and the methodology behind it. In addition, we extend their work by proposing a model-based estimator of their measure. We conduct simulation studies to evaluate the performance of the model-free and model-based measure. Initial results demonstrated that model-based estimates of the measure from well-fitted models compared similarly to the model-free estimator of the measure, suggesting further exploration into the possibility of using the model-free estimator as a goodness of fit measure.