UBC-MDS / DSCI_522_group_401

Medical Expenses Prediction
MIT License
0 stars 4 forks source link

Pre_processing #8

Closed sirine-chahma closed 4 years ago

sirine-chahma commented 4 years ago

As our dataset is already very clean (no missing values, and no outliers due to an error in the measurement for instance), I think we should just split the data into a train set and a validation set in the pre_processing file, and we should just create a jupyter notebook file were we explore the data and show that there is not that much to do in the "cleaning" step. What do you think?

sreejithmunthikodu commented 4 years ago

I think that is a good idea. To add to that, may be in the script we can check for missing values and print a message showing how many samples have missing values Same about outliers too....

sirine-chahma commented 4 years ago

But We check for missing values and outliers in a separate jupyter notebook file. Then we can put a link to this notebook in our final report (script 5) at the beginning when we say that we didn't do any cleaning because there were no outliers and no missing values. Do you think it's enough? I just feel that this would be easier and easier to visualize, and I don't see the point of looking for missing data and outliers in two different files.

sreejithmunthikodu commented 4 years ago

I asked on Slack... Tiffany replied, all our codes should be in the script. We should use jupyer notebook/ rmd only for script 5

sreejithmunthikodu commented 4 years ago

Could you please use the following setting while preprocessing the data? Hyper-parameters are tuned as per the same. train_test_split(X, y, test_size=0.3, random_state=123) Thanks

sirine-chahma commented 4 years ago

I don't know if it's going to work even if we have the same random state because I am spliting the data with R... but I will still do it! :)