When is it possible to learn from data?

As noted by David Hume in An Enquiry concerning Human Understanding, we do not have any rigorous justification of induction, in other words, the possibility of extrapolating a general rule from a small amount of data does not follows from first principles and seems not granted. Yet the triumph of science suggests that many phenomena of our world have a sort of explainability property which allows the induction process.

If we want to create machines capable of completing the induction process we should investigate the conditions under which this explainability property applies, and learning from a small dataset is not impossible.

In this regard consider the following central result:

In which sense this theorem is granting the possibility of the induction process?
Which are the crucial assumptions of this theorem?
In the learning framework described by this theorem, in which situation a learner is vulnerable to overfitting?
Can you figure out the proof for this theorem?
Can you make some examples of phenomena for which the explainability property does not apply? Consider all the phenomena for which the theorem applies. For all of these, if you manage to find a hypothesis in accordance with a set of reliable points, that hypothesis is guaranteed to generalize with high probability. Yet, for some of them finding this hypothesis may be practically impossible, for instance because of computational complexity. We are saying that for these phenomena the explainability property does not apply. Can you list one or more of these phenomena?

[We will provide a reference next week]

Since it's already been a day, I'll try to address some of the points, even though my answers are incomplete and might be trivial (and/or wrong). I think it's useful to spell out in simple words what the theorem is telling us: if we sample a large enough number of training data, and if we find any hypothesis which is good on the training data (ie: h(xi)=f(xi) for each i) then with high probability our hypothesis will generalize well (ie: P[h(x)=f(x)] >= 1-eps for a new sampled x). So the theorem is giving us a sufficient condition under which the learning process is possibile with theoretical guarantees.

However there are some assumptions made by the theorem which are not really realistic: the hypothesis class H is finite, while we normally deal with infinite classes; moreover it assumes that the true labelling f is in H (so we have some prior knowledge about f, which we usually don't). Some other assumptions are realistic: it assumes the training set comes from the same distribution of the test data, and there is one assumption that looks even pessimistic to me: H is fixed before sampling the training set, while in practice we first look at the data and then choose the class of models we think will work best (neural networks, random forests, etc..). I guess this last assumption (together with |H| finite) is particularly useful in proving the theorem (even though I haven't managed to do it yet).

At first glance it might look like the theorem isn't taking into account overfitting, in fact it says "given enough data and small training error, we get small generalization error", I think the catch here is that the cardinality of H represents in some sense the complexity of the models we are considering, and the amount of data the theorem needs increases with |H| (and adding more data is a trivial way to fight overfitting). Then in this setting, an overfitting scenario would be one where we have a large |H| (ie: complex models) and small m (few training data).

The induction process consists in generalizing the explanation of a phenomenon and in this sense universalizing it. In other words the induction process give us the possibility to explain new phenomena from the knowledge of some other experiences. This theorem is granting the possibility of the induction process because from the knowledge of a finite set of points it is able to derivate a general rule of a hypothesis class. I think that the fundamental assumptions of this theorem are that H is a finite hypothesis class, we know the probability distribution D and that our dataset is drawn indipendently from D. These are some assumptions that usually we don't get access to. We can have some overfitting problems because we don't have any relation between the number of points in the data set and the number of parameters of our hypothesis. Moreover in learning process we usually wish to have a representative dataset for our phenomenon. In this theorem we don't have the guarantee that D is a representative distribution of the phenomenon. I think that the human learning process follows a nonlinear route which get started from the relationship with the reality (a dataset). Then through an intuition we come to some fundamental assumptions that we can't prove with a rational reasoning. Eventually with a logical reasoning we come to a deduction and we generalize it through induction. I think that the explainability property and then induction can fail if we crazily pretend to have a priori knowledge of the reality. For example if we pretend to use only our a priori knowledge to explain the phenomenon "The sun rises every day" we won't be able to predict that tomorrow the sun will rise because we are pretending to explain something without looking at the reality.

In which sense this theorem is granting the possibility of the induction process?

We are told that if we can find a good hypothesis s.t. we get m reliable sample points (i.e. predicted correctly) and we have a "sufficient" number of these points, then it's likely that our hypothesis will be correct on further points in the future. The likelihood of the existence of such hypothesis, in my opinion, is "granted" by the fact that the number m of points needed to have a "solid" hypothesis is logarithmic wrt |H|/delta, so we can, let's say, have a relatively small m to tolerate more failures.

Which are the crucial assumptions of this theorem?

First, the theorem assumes that the samples are independently drawn from D and that also all the "to be predicted" samples will also be drawn from it. Also, our hypothesis class H is assumed to be the one that contains the function f, but if H is the set of "probably correct answers" and f is the set of "all the functions S->{0,1} s.t. some are correct, many more are not", it's unlikely that H will contain f.

I agree on (nice answer Mirko):

there is one assumption that looks even pessimistic to me: H is fixed before sampling the training set, while in practice we first look at the data and then choose the class of models we think will work best (neural networks, random forests, etc..)

3.In the learning framework described by this theorem, in which situation a learner is vulnerable to overfitting?

I think that having a large number m of reliable points might cause overfitting (i.e. m >> 1/eps ln |H|/delta) since we're told that, under the right assumptions, m = 1/eps ln |H|/delta is sufficient.

Can you figure out the proof for this theorem?

coming...

5.Can you make some examples of phenomena for which the explainability property does not apply?

Let's say 2 students are perfectly equivalent under all the possible evaluable characteristics when applying for an internship. The committee head will eventually make a decision based upon personal experiences whereas a different head would've made a different choice. That choice process is unexplainable. In general, if we're not able to grasp on the complexity of a process (even if that has a foundation e.g. deep NNs) that might seem unexplainable.

Some comments on the answers given so far:

To @Mirko222

I think it's useful to spell out in simple words what the theorem is telling us: if we sample a large enough number of training data, and if we find any hypothesis which is good on the training data (ie: h(xi)=f(xi) for each i) then with high probability our hypothesis will generalize well (ie: P[h(x)=f(x)] >= 1-eps for a new sampled x).

This is a good rephrasing which helps to understand the theorem.

So the theorem is giving us a sufficient condition under which the learning process is possibile with theoretical guarantees.

Are you sure that the theorem is giving us a sufficient condition under which the learning process is possible? Is this theorem giving us a way to find a good hypothesis h?

However there are some assumptions made by the theorem which are not really realistic: the hypothesis class H is finite, while we normally deal with infinite classes; moreover it assumes that the true labelling f is in H [...] and there is one assumption that looks even pessimistic to me: H is fixed before sampling the training set

Do we normally deal with infinite classes without fixing H a priori? Consider the above assumptions in the following two learning from data scenarios very common in 2020s. For instance, which is the cardinality of H in these cases?

Deep learning. You train a very big neural network to do image classification. Say this NN has 100 million float32 parameters.
Physics. You study and make experiments on a phenomenon, you are supposed to explain it in a 10-page length paper in English containing approximately 10,000 words.

Then in this setting, an overfitting scenario would be one where we have a large |H| (ie: complex models) and small m (few training data).

Good catch. What could happen if we have a dataset with size k < m ?

To @elisabal

In other words the induction process give us the possibility to explain new phenomena from the knowledge of some other experiences.

You emphasized an important concept, new knowledge builds from old knowledge. With respect to the theorem, the knowledge of old experiences guides us in the choice of a proper hypothesis class H for a new experience. This position is somewhat opposite to the one of @Mirko222, which would look at data to choose the proper hypothesis class H:

in practice we first look at the data and then choose the class of models we think will work best

This theorem is granting the possibility of the induction process because from the knowledge of a finite set of points it is able to derivate a general rule of a hypothesis class.

I think that the fundamental assumptions of this theorem are that H is a finite hypothesis class, we know the probability distribution D and that our dataset is drawn indipendently from D. These are some assumptions that usually we don't get access to.

Same points I raised to Mirko222:

is this theorem giving us a way to find a good hypothesis h? The result of this theorem is weaker.
I would agree with you that some of these assumptions are not reasonable, especially the one on D, but why? Nevertheless, don’t we always use these assumptions? Aren't we always assuming a finite H and an iid sampling from D?

We can have some overfitting problems because we don't have any relation between the number of points in the data set and the number of parameters of our hypothesis.

The result of this theorem is a relation between the number of points in the data set (m) and the number of parameters of our hypothesis (|H|).

I think that the explainability property and then induction can fail if we crazily pretend to have a priori knowledge of the reality. For example if we pretend to use only our a priori knowledge to explain the phenomenon "The sun rises every day" we won't be able to predict that tomorrow the sun will rise because we are pretending to explain something without looking at the reality.

With only a priori knowledge you cannot say anything new on reality, I agree. But are we sure that a priori knowledge is not useful at all? Can we learn anything without any prior knowledge? If not, which is the prior knowledge we need?

To @Korov-ev

The likelihood of the existence of such hypothesis, in my opinion, is "granted" by the fact that the number m of points needed to have a "solid" hypothesis is logarithmic wrt |H|/delta

Nice catch! That logarithm is justifying the effective usage of small datasets to verify that a hypothesis generalizes.

First, the theorem assumes that the samples are independently drawn from D and that also all the "to be predicted" samples will also be drawn from it.

I agree with you that this assumption is crucial, do you consider it reasonable to model the learning problem? Do this framework leaves out important learning processes where the training distribution does not coincide with the test distribution?

Also, our hypothesis class H is assumed to be the one that contains the function f, but if H is the set of "probably correct answers" and f is the set of "all the functions S->{0,1} s.t. some are correct, many more are not", it's unlikely that H will contain f.

I did not quite understand, f is not a set, it is a single function.

I think that having a large number m of reliable points might cause overfitting (i.e. m >> 1/eps ln |H|/delta)

Actually, by definition the points are reliable precisely when every hypothesis in accordance with them generalizes (with a probability as high as you want). While overfitting occurs when you have a hypothesis in accordance with your training data but which do not generalize.

Let's say 2 students...

The question wanted a more precise answer, I will expand the question to make it more clear, giving also a hint.

Can you make some examples of phenomena for which the explainability property does not apply? Consider all the phenomena for which the theorem applies. For all of these, if you manage to find a hypothesis in accordance with a set of reliable points, that hypothesis is guaranteed to generalize with high probability. Yet, for some of them finding this hypothesis may be practically impossible, for instance because of computational complexity. We are saying that for these phenomena the explainability property does not apply. Can you list one or more of these phenomena?

To @noranta4

I did not quite understand, f is not a set, it is a single function.

I'll try to rephrase it, as I've written it confusingly: the cardinality of the set of all the functions S to {0,1} over the sample distribution is big. Thus we have to make the assumption that H (which "has" to be finite and "reasonably sized") will contain f:S->{0,1}.

Actually, by definition the points are reliable precisely when every hypothesis in accordance with them generalizes (with a probability as high as you want). While overfitting occurs when you have a hypothesis in accordance with your training data but which do not generalize.

Yeas, of course you are right, I totally missed out that one. True, if H is very complex and |m| does not grasp that, the chances of it being a good generalization are low.

Can you make some examples of phenomena for which the explainability property does not apply? Consider....

We discussed about the assumptions under whose this theorem holds. Thus if we rule out these assumptions we lose this property: if, for instance, want to predict the brownian motion of a particles system, further observation will not follow the distribution of the previous ones.

To @noranta4

Are you sure that the theorem is giving us a sufficient condition under which the learning process is possible? Is this theorem giving us a way to find a good hypothesis h?

No the theorem is not giving a way to find a good hypothesis: it just says that, given enough data, all the hypothesis perfect on the training set will also generalize well. So yes, it's not correct to say that it is a sufficient condition under which the learning process is possible because h might not even exists or be very hard to find. I guess what we can say is that if we have enough training data and we happen to find (in some unspecified way) an h in accordance with all the training data, then this is sufficient for h to have a small generalization error.

Do we normally deal with infinite classes without fixing H a priori? Consider the above assumptions in the following two learning from data scenarios very common in 2020s. For instance, which is the cardinality of H in these cases? [...]

Yes I didn't think about it, in practice the number of parameters is finite so also H is finite: in the first example |H|=2^(32 * 100million), in the second: 26^#characters (without considering special symbols which I guess are needed)

The reference promised: section 7 of Why Philosophers Should Care About Computational Complexity by Scott Aaronson.

The theorem proposed is a fundamental result of PAC learning, the discussion of the theorem in this paper contains the answers to all the posed questions, including the demonstration.

erodola / DLAI-s2-2021

When is it possible to learn from data? #8