UBC-MDS / DSCI_522_Group-308_Used-Cars

This project attempts to build a regression model to predict price of used cars based on numerous features of the car
MIT License
2 stars 6 forks source link

Split data before EDA #6

Closed bradentam closed 4 years ago

bradentam commented 4 years ago

We need to change our EDA so that we split the data beforehand (preferably within the script). Hopefully we can write the script such that the data get put into the data folder in correct train/test sets. We could do a 90/10 split since we have a lot of data.

AndresPitta commented 4 years ago

I think the problem with doing the split in the EDA is that the model is going to have a dependecy to the EDA notebook. I'm not sure if we are required to leave all the dependencies for the scripts.

I'm going to change it in the EDA but I think the correct way is to do it in the upload/download script @pokrovskyy. and then we can discuss that next week (and ask Firas/Tiffany).

AndresPitta commented 4 years ago

Done, I changed the notebook to have these lines

ksedivyhaley commented 4 years ago

I think the problem with doing the split in the EDA is that the model is going to have a dependecy to the EDA notebook. I'm not sure if we are required to leave all the dependencies for the scripts.

I'm going to change it in the EDA but I think the correct way is to do it in the upload/download script @pokrovskyy. and then we can discuss that next week (and ask Firas/Tiffany).

In cast you haven't seen yet, this would go in the second script used if following the recommended organization for Milestone 2:

A second script that reads the data from the first script and performs and data cleaning/pre-processing, transforming, and/or paritionting that needs to happen before exploratory data analysis or modeling takes place. This should take at least two arguments:

  • a path/filename pointing to the data to be read in
  • a path/filename pointing to where the cleaned/processed/transformed/paritioned data should live