As we discussed last Thursday, it could be interesting to train and make predictions on two different subsets of the data (one with a lot of NaN per row, and one with almost complete rows).
This operation can possibly takes place directly in the Preprocessor (which separate the data in two different preprocessed files), or takes place in a DataLoader, which does the clustering internally and delivers the data in two clusters, saying from which cluster comes the data served (DataLoader can possibly be an iterator).
The assignee should choose the best option from his point of view. The idea on this issue is also to adapt the model and the pipeline to this new view of the data.
This issue is not a priority as soon as #2, #3 and #6 are still opened.
As we discussed last Thursday, it could be interesting to train and make predictions on two different subsets of the data (one with a lot of NaN per row, and one with almost complete rows).
This operation can possibly takes place directly in the Preprocessor (which separate the data in two different preprocessed files), or takes place in a DataLoader, which does the clustering internally and delivers the data in two clusters, saying from which cluster comes the data served (DataLoader can possibly be an iterator).
The assignee should choose the best option from his point of view. The idea on this issue is also to adapt the model and the pipeline to this new view of the data. This issue is not a priority as soon as #2, #3 and #6 are still opened.