UChicago-Computational-Content-Analysis / Frequently-Asked-Questions

0 stars 0 forks source link

extrapolated data #15

Open facundosuenzo opened 2 years ago

facundosuenzo commented 2 years ago

Hi all – I'm sorry, maybe this is too silly, but what does "extrapolated (uncoded data)" in the exercises for Week 5 mean? The information in the notebook covers training and test, but I'm not sure how we are supposed to work with uncoded data. T Thanks!

jacyanthis commented 2 years ago

It just means data that is neither in your training or test set. The typical approach is to have some coded data (say, 100 documents), then do a training/test split (say, 80 and 20) where the model learns on training data then tests itself on coded test data. However, in many classification projects, our goal is to get some model-produced label for a much larger set of data than we could code by hand, so you extrapolate to other data after training the model with the training/test data.

The following is not important but may be interest: Extrapolation is kind of a weird word here, so I'm going to edit the notebook for next year. Usually it refers to applying the model data that does not come from the same distribution or data-generating process as the model was trained on. For example, we train a model on 2020 data but then extrapolate to 2021 data. Maybe we have some sense of how the documents are changing over time, or maybe we just hope that the model still works for data from a different distribution. In the Week 6 materials, you will see that this is sometimes called "Prediction" as opposed to the within-distribution "Classification" of Week 5.