Closed trangdata closed 4 years ago
Good catch! But I think we should keep those duplicated rows like its source.
@weixuanfu I disagree. I think we should stay close to the source but also try our best to eliminate potential issues for the user when they use the data to train their models (just like how we decided to remove the 7th column of bupa). We should note in our metadata description that we removed duplicated rows. Also, we have over 1 million rows for these datasets. Removing ~ 2000 should still leave us with a lot of records, right?
I think this case is different with 7th column of bupa. I agree that ~2000 rows are duplicated records but I think they should challenge some ML algorithms to handle them. I think we can add a option to drop duplicated rows like we added drop_na
option.
they should challenge some ML algorithms to handle them
I don't think this is a "challenge" for algorithms. If some of these duplicates happen to be in both training and testing sets, we have a case of overfitting.
After our discussion today, we agree that we cannot verify the nature of the duplicated rows. Therefore, we will keep them in the dataset. However, I will make a note in the description field of the metadata.
@weixuanfu Could you help resolve the conflicts and add one missing row, please?
OK, I will fix that.
Hmm, I think we should remove deploy step in PR, which is main reason of those conflicts.
Hmm, I think we should remove deploy step in PR, which is main reason of those conflicts.
Oh I see. OK.
Google collab notebook
Notes from the notebook: