blei-lab / causal-text-embeddings

Software and data for "Using Text Embeddings for Causal Inference"
MIT License
123 stars 17 forks source link

Question #3

Closed srn284 closed 4 years ago

srn284 commented 4 years ago

Dear Blei Lab, I had a quick question. I was wondering why in the compute ate section you use the training data ie this line of code:

    all_estimates = ate_estimates(q_t0[in_train], q_t1[in_train], g[in_train], t[in_train], y[in_train],
                                  truncate_level=0.03)

It looks like the training points are considered rather than the test (k folds for dataset)? just wondering the effects of overfitting on training data (if it happens) in this case? the line for the same line of code but for test data has been commented out. I was wondering if that's by design? I might have misunderstood some things so I apologise if that's the case. thank you very much.

vveitch commented 4 years ago

That's a great question :)

First, notice that it's not clear we should do any data splitting at all in this case---we're trying to estimate a parameter of the world, not assess the predictive accuracy of a model. The conventional statistics approach would just be to use all the data for training the model, and then again use all the data for fitting it. That's effectively what we're doing.

So why is there even code to data split? It turns out that certain theoretical guarantees about efficient estimation rely on a notion of 'not overfitting'. This isn't required for asymptotic guarantees (consistency), just the 'efficient' use of data. It further turns out that splitting the data and using the held-out data in the estimation step is sufficient (but not necessary) to guarantee this 'not overfitting' condition. See https://academic.oup.com/ectj/article/21/1/C1/5056401 for an extensive discussion. The next obvious thing to do is then split the data into k-folds, and estimate the parameter as the average of the k split-data estimates (that's the Chernozhukov procedure). The semi-parametric estimation code was written to allow us to do that easily.

So why do we not actually do it? Because in every single situation where I've tried the cross-fitting approach across several projects and several data domains, it always works worse than just using all of the data for both training and estimation. The key is that the data-splitting approach is theoretically sufficient but not necessary. It seems that in deep learning cases, it's not necessary practically, and indeed it ends up hurting.

So, in short, it's by design :)

srn284 commented 4 years ago

Dear Victor, Amazing answer. :) Thank you. Seriously GitHub is filled with lackluster answers for queries so I appreciate it.

Yes indeed, I was vaguely familiar with these guarantees and that they are important. So in this case we don’t need k folds and just work on the whole dataset?

Thanks again, Shishir

On Nov 7, 2019, at 4:31 PM, Victor Veitch notifications@github.com wrote:

That's a great question :)

First, notice that it's not clear we should do any data splitting at all in this case---we're trying to estimate a parameter of the world, not assess the predictive accuracy of a model. The conventional statistics approach would just be to use all the data for training the model, and then again use all the data for fitting it. That's effectively what we're doing.

So why is there even code to data split? It turns out that certain theoretical guarantees about efficient estimation rely on a notion of 'not overfitting'. This isn't required for asymptotic guarantees (consistency), just the 'efficient' use of data. It further turns out that splitting the data and using the held-out data in the estimation step is sufficient (but not necessary) to guarantee this 'not overfitting' condition. See https://academic.oup.com/ectj/article/21/1/C1/5056401 for an extensive discussion. The next obvious thing to do is then split the data into k-folds, and estimate the parameter as the average of the k split-data estimates (that's the Chernozhukov procedure). The semi-parametric estimation code was written to allow us to do that easily.

So why do we not actually do it? Because in every single situation where I've tried the cross-fitting approach across several projects and several data domains, it always works worse than just using all of the data for both training and estimation. The key is that the data-splitting approach is theoretically sufficient but not necessary. It seems that in deep learning cases, it's not necessary practically, and indeed it ends up hurting.

So, in short, it's by design :)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

vveitch commented 4 years ago

" So in this case we don’t need k folds and just work on the whole dataset?"

Yes, exactly.