Closed techrah closed 4 years ago
I can use the DataPartition
function in R to split the data so the observations selected from each region is even, I think. I will play around with it and get back to this issue.
After our final discussion regarding this, I think we're good on the split.
@katieb1 @andrealee011
RE: #5
I just took a look at the raw data and noted that the rows are grouped by region. Each region has 52 observations, numbered 0 - 51. I think we therefore need to do stratified sampling in order to get a representative test set.
Also, since we will now be hosting the data in our own repository, we can keep the training and test data separated, each set in its own file. This will prevent us from accidentally using the test data.
Thoughts on this?