UBC-MDS / DSCI_522_Group411

Avocado Price Predictors
MIT License
1 stars 4 forks source link

Regarding train/test split #7

Closed techrah closed 4 years ago

techrah commented 4 years ago

@katieb1 @andrealee011

RE: #5

I just took a look at the raw data and noted that the rows are grouped by region. Each region has 52 observations, numbered 0 - 51. I think we therefore need to do stratified sampling in order to get a representative test set.

Also, since we will now be hosting the data in our own repository, we can keep the training and test data separated, each set in its own file. This will prevent us from accidentally using the test data.

Thoughts on this?

katieb1 commented 4 years ago

I can use the DataPartition function in R to split the data so the observations selected from each region is even, I think. I will play around with it and get back to this issue.

techrah commented 4 years ago

After our final discussion regarding this, I think we're good on the split.