Split data before EDA - Githubissues

bradentam commented 4 years ago

We need to change our EDA so that we split the data beforehand (preferably within the script). Hopefully we can write the script such that the data get put into the data folder in correct train/test sets. We could do a 90/10 split since we have a lot of data.

AndresPitta commented 4 years ago

I think the problem with doing the split in the EDA is that the model is going to have a dependecy to the EDA notebook. I'm not sure if we are required to leave all the dependencies for the scripts.

I'm going to change it in the EDA but I think the correct way is to do it in the upload/download script @pokrovskyy. and then we can discuss that next week (and ask Firas/Tiffany).

AndresPitta commented 4 years ago

Done, I changed the notebook to have these lines

ksedivyhaley commented 4 years ago

I think the problem with doing the split in the EDA is that the model is going to have a dependecy to the EDA notebook. I'm not sure if we are required to leave all the dependencies for the scripts.

I'm going to change it in the EDA but I think the correct way is to do it in the upload/download script @pokrovskyy. and then we can discuss that next week (and ask Firas/Tiffany).

In cast you haven't seen yet, this would go in the second script used if following the recommended organization for Milestone 2:

A second script that reads the data from the first script and performs and data cleaning/pre-processing, transforming, and/or paritionting that needs to happen before exploratory data analysis or modeling takes place. This should take at least two arguments:

a path/filename pointing to the data to be read in

a path/filename pointing to where the cleaned/processed/transformed/paritioned data should live

UBC-MDS / DSCI_522_Group-308_Used-Cars

Split data before EDA #6