NathanRxl / bnp-cardif-challenge

After deadline Kaggle Competition, Master's Data Science, 2017
0 stars 0 forks source link

Cluster the train and test data into 2 different clusters #10

Open NathanRxl opened 7 years ago

NathanRxl commented 7 years ago

As we discussed last Thursday, it could be interesting to train and make predictions on two different subsets of the data (one with a lot of NaN per row, and one with almost complete rows).

This operation can possibly takes place directly in the Preprocessor (which separate the data in two different preprocessed files), or takes place in a DataLoader, which does the clustering internally and delivers the data in two clusters, saying from which cluster comes the data served (DataLoader can possibly be an iterator).

The assignee should choose the best option from his point of view. The idea on this issue is also to adapt the model and the pipeline to this new view of the data. This issue is not a priority as soon as #2, #3 and #6 are still opened.

NathanRxl commented 7 years ago

bbattino worked on this issue but did not take time to integrate it on master. Due to the fact the challenge ends this week, we probably won't do it.