Closed sirine-chahma closed 4 years ago
I think that is a good idea. To add to that, may be in the script we can check for missing values and print a message showing how many samples have missing values Same about outliers too....
But We check for missing values and outliers in a separate jupyter notebook file. Then we can put a link to this notebook in our final report (script 5) at the beginning when we say that we didn't do any cleaning because there were no outliers and no missing values. Do you think it's enough? I just feel that this would be easier and easier to visualize, and I don't see the point of looking for missing data and outliers in two different files.
I asked on Slack... Tiffany replied, all our codes should be in the script. We should use jupyer notebook/ rmd only for script 5
Could you please use the following setting while preprocessing the data? Hyper-parameters are tuned as per the same. train_test_split(X, y, test_size=0.3, random_state=123) Thanks
I don't know if it's going to work even if we have the same random state because I am spliting the data with R... but I will still do it! :)
As our dataset is already very clean (no missing values, and no outliers due to an error in the measurement for instance), I think we should just split the data into a train set and a validation set in the pre_processing file, and we should just create a jupyter notebook file were we explore the data and show that there is not that much to do in the "cleaning" step. What do you think?