Open ajzwu8 opened 8 years ago
Dear ajzwu8: Thank you for your sincere advice.
For preliminary analysis, our plots show how much variance of the proportion of Y=1 is due to feature Xj. In this way, we can find the feature Xj that has significant influence on the target Y. We plotted 16 plots for the 16 features. However, due to 3 page limit, we only chose 2 typical variables—one with significant variance, one with insignificant variance. Does that make sense?
Since all the features Xj and target variable Y are categorical, we can hardly think of any plots other than bar plots of number or frequency within each category. Could you help us recommend other plots for this kind of data?
Of course, if needed, we can give more plots when our analysis goes deeper on November.
This mid-term project requires only preliminary analysis, so we only ran a baseline model and left more work on November. Otherwise, the project is close to be finished.
In addition, we found our original goal does not make sense on Oct. 28th, so we changed our goal and had very limited time to restart. Since Naive Bayes algorithm is one of the most simple algorithms for classification problem that directly use categorical features without one-hot encoding, we adopted it for the start. Certainly, we will adopt and compare other classification methods on November.
BTW, the target variable Y is categorical, so this is a classification problem, whereas linear regression is for target Y with continuous value.
Thank you. Best, Ziyi
Overall, the steps of the project so far and how the team is doing is clearly conveyed in the report. Everything is to the point, and the visuals are very helpful in terms of understanding different components of your data. However, there may be a bit more to show here; perhaps plots that tie more insight into analysis.
Also, why Naive Bayes? I would have liked to see more justification for this type of model. I think a linear regression itself would also produce some interesting features, perhaps with a regularization that promotes sparse data in order to single out features. Also, does your cross validation properly distribute the small number of positive labels when resampling again and again? This may potentially skew your model if not. It does, however, seem from your future work that you are looking into ways to address it.
Best of luck!