Computational-Content-Analysis-2020 / Readings-Responses

Repository for organising "exemplary" readings, and posting reponses.
6 stars 1 forks source link

Classifying Meanings & Documents - Witten...& Pal 2017 #12

Open jamesallenevans opened 4 years ago

jamesallenevans commented 4 years ago

Post questions here about:

Witten, Ian H., Eibe Frank, Mark A. Hall, Christopher J. Pal. 2017. “Ensemble Learning”, Chapter 12 from Data Mining: Practical Machine Learning Tools and Techniques, 4th Edition: 351-371.

laurenjli commented 4 years ago

I like how this chapter introduced option trees and logistic model trees as ensemble methods that are more easily interpretable. Given that there are so many different types of ensemble methods that often serve the same purpose of making classification decisions more reliable and stable, how does one think about balancing interpretability with classification performance? Are there quantitative ways to set a threshold for when a better performing but more obscure classifier is worth the lack of interpretability?

katykoenig commented 4 years ago

In this reading, we learn that ensemble methods consistently perform better than single classifiers through reducing variance (and for boosting, bias as well). Assuming that better performance means higher accuracy, how do ensemble methods affect precision-recall? Do both measures reduce evenly as accuracy increases when using ensemble methods vs. single models ceteris paribus?

ziwnchen commented 4 years ago

This reading about the ensemble modeling strategy is very useful. My question is about the number of models to be ensembled. For bagging and boosting, how could we know the optimal number of t (the number of the same model)? Or for stacking, which is even more complicated, how could we know if one model should be included in the ensemble model or not? Is it just the more the better, or is there any threshold beyond which increasing model does not make much improvement (i.e., the information contained in the training set has its limit)?

lkcao commented 4 years ago

My question is about method choice. From this chapter we know that bagging, boosting and stacking are all useful ensemble methods that can be applied to increase the performance of classifiers. However, what is the criterion for us to choose one method over the others? Under what circumstances do we make the decision about which one to use?

sanittawan commented 4 years ago

After having read the chapter, I am curious about two things.

(1) Is there a rule of thumb on how large the data set should be for ensemble learning models to perform much better than simpler learning models? (2) How does cross-validation to select the best performing model fit in when training and evaluating ensemble learning models? Are we using the training sets to tune the hyperparameters?

wunicoleshuhui commented 4 years ago

I think the stacking model is very interesting and potentially very useful in making predictions. I'm however wondering in what specific cases can the stacking model generally achieve verifiable success in predictions, and is it time-consuming to use or dependent on sufficient CPU space?

heathercchen commented 4 years ago

This chapter introduces several methods to ensemble different models in order to obtain a "perfect-at-hand" model with the least errors. I have a question relating to a specific sentence in section 8.3 Randomization. In this section, the authors argued that "More randomness generates more variety in the learner but makes less use of the data." Why is it true? We did not limit our choices to a certain subset of data compared with models that do not introduce randomization. And we are still making inferences based on the overall data.

HaoxuanXu commented 4 years ago

The chapter of ensemble learning describes that aggregation of weak learners can make better predictions both in terms of accuracy and in terms of when given new data. My question is how we should know the stopping point for boosting methods since continuous iteration can still improve the performance face given new data even when the loss is already 0. I'd also love to know if there are additional ways to conduct majority voting ways of the individual trees besides using the mean.

tzkli commented 4 years ago

In the section about boosting, the author mentions that boosting often helps build classifiers that predict fresh data much more accurately than those generated by bagging. But they also state that "boosting sometimes fails in practical situations: it can generate a classifier that is significantly less accurate than a single classifier built from the same data" while bagging is not subject to this problem. This finding seems contradictory to me: how can boosting-induced classifiers be more accurate and less accurate at the same time? Are there any qualifications that are missing here?

bjcliang-uchi commented 4 years ago

I am interested in how to mathematically understand the bias-variance tradeoff for each model. Also, which model can be generalized to unsupervised ML models and how?

cindychu commented 4 years ago

In the Witten (2011)’s chapter, many ensemble techniques are introduced, which are all very advanced to me and I learned a lot. I have a question regarding the difference between bagging and cross-validation. Conceptually, they seem very similar. Both train model on different sets of training data. so I am wondering what is the main conceptual difference between these two ‘methods' during application? how does their computational difference make it possible? and is it possible to use both of them together when training a model?

jsmono commented 4 years ago

This is a great introduction to the data mining field! I think the authors did a great job in describing how each method functions, but I'm wondering if there are examples of how bagging, randomization, boosting and addictive regression are applied in real life. Are there any platforms or technologies we use that are applying these approaches to data?

luxin-tian commented 4 years ago

In regard to the bias-variance decomposition part and the authors' comments on the version they choose, I did not quite understand that what does it mean for the variance to be negative. It is intuitively perceptible that aggregating results from different models can increase the overall bias, but since a variance term is always positive, how can we see this from the theoretical probability framework?

cytwill commented 4 years ago

I have some similar questions as @ziwnchen, are there any benchmarks to choose the parameters in these ensembling methods? Some methods I think are pretty much building a model upon multiple eligible models, so do you think there are rules to set the benchmark for these original models? And also, how can we decide whether to use bagging, stacking or boosting methods if the number of eligible models is too large to replicate every possible solution (supposing we do not want to do so~)?

arun-131293 commented 4 years ago

Regarding interpretability, it has to be kept in mind that, in modern statistical learning the goal is not to follow the Newtonian ideal of trying to discover simplicity in nature by isolating the multiple processes that generates data and study each process with the aim of getting the simplest possible principled explanation for it; rather the goal here is to build a model(which could consist of other models) which can be as complex it needs to be to capture as much of the complexity in the data as possible. This split between the two schools of thought is best exemplified by the tensions between linguist Noam Chomsky (an advocate for the Newtonian Method in studying Natural Language) and Peter Norvig(director of research at Google, who works on Natural Language Processing using mordern Statistical). Norvig himself talks about the two schools of thought here (http://norvig.com/chomsky.html). Therefore, interpretability in this context, should not be confused with principled explanation, whereas an explanation is provided in accordance with certain first principles that apply to the field of study, like conservation laws in physics. To be honest, I think interpretability in this context just means you can justify an Frankenstein model by being able to track the decision process better.

deblnia commented 4 years ago

I'd appreciate some more explanation on cross-validation. The authors propose it as a stopping condition to prevent overfitting in additive regression, but resampling procedures like cross-validation 1) require enough data that the split groups of data are meaningful and 2) do not produce any method of comparing one ensemble method against another.
How much data should we have? And how do we compare different models?

adarshmathew commented 4 years ago

The concept of Stacking and creating meta-learners in Witten et al. (2017) caught my attention. Primarily because it provides an alternative to averaging the results from the constituent models. From the chapter:

Although developed some years ago, it is less widely mentioned in the machine learning literature than bagging and boosting, partly because it is difficult to analyze theoretically and partly because there is no generally accepted best way of doing it—the basic idea can be applied in many different variations.

Could you provide us with an example or two of a successful and/or interpretable application of Stacking, and the advances that have been introduced in current literature that makes the approach easier to analyze theoretically?

skanthan95 commented 4 years ago

I wanted to learn more about how these statistical models are used to classify categorical data that isn’t from text (instead, from audio, images, etc). How are topics for sorting determined and what are the overlaps and differences in the process for text data?

di-Tong commented 4 years ago

As @clk16 and @sanittawan have mentioned, I wonder if there's a rule of thumb for us to choose between models and tune the hyperparameters in real applications.

ccsuehara commented 4 years ago

I'd like to know if it's a good strategy in general, to estimate our parameters using all the proposed models and see each individual performance, or contrary, choose a best one a priori.

alakira commented 4 years ago

In a content analysis study, how much model interpretability is needed? Ensemble learning is an essential way to improve performance, but when should we avoid this kind of incomprehensible method?

rachel-ker commented 4 years ago

I find the idea of interpretable ensembles very interesting. I was wondering if the interpretably also sacrifices the predictive power of ensembles or is this depending on the data? Also, as some others have asked above, when choosing models should we empirically test performances on a range of different models to determine the method or should it be chosen based on theoretical assumptions?

kdaej commented 4 years ago

To use bagging, the dataset needs to be randomized so that each bag is equally representative of the original corpus. The reading states that some learning algorithms already have a built-in random component. However, in some cases, we may already have some idea of how the dataset might be categorized. If that is the case, should we take the prior knowledge on the categories into account when we generate randomized subsets?

YanjieZhou commented 4 years ago

When using bagging, we use several models collectively to produce a better result, which is fantastic, but how to determine the boundary of choosing proper models. Or in other words, do we need to filter the models by their degrees of error or we just simply include as many models as we prefer?

Lizfeng commented 4 years ago

This paper talks about the assumptions behind ensemble learning. Using bias-variance decomposition, we can separate the effect of combining multiple hypotheses. Bagging is designed to reduce the variance arised from the training set, while boosting focuses on seeking models that complement one another. One of the big disadvantages of ensemble learning is that it is oftentimes uninterpretable. However, using option tree and logistic model tree, we can still interpret our outcome. While ensemble learning improves the model's predictability, I think more statistical theory needs to be developed to support the usage of it in addition to the additive models.

VivianQian19 commented 4 years ago

The chapter on ensemble learning is very useful and gives an introduction of various ensemble schemes such as bagging, boosting, and stacking. While these schemes have obvious implications for computer science research, I wonder how much of these method are useful for social science research? And is there a trade-off in interpretability of the model when we increase the model’s accuracy?

yaoxishi commented 4 years ago

I am wondering for the model validating, in addition to use the built-in validating methods, is there any other ways that we could verify whether we used the right model or not?