EpistasisLab / pmlb

PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms.
https://epistasislab.github.io/pmlb/
MIT License
805 stars 135 forks source link

Add metadata for poker #60

Closed trangdata closed 4 years ago

trangdata commented 4 years ago

Google collab notebook

Notes from the notebook:

weixuanfu commented 4 years ago

Good catch! But I think we should keep those duplicated rows like its source.

trangdata commented 4 years ago

@weixuanfu I disagree. I think we should stay close to the source but also try our best to eliminate potential issues for the user when they use the data to train their models (just like how we decided to remove the 7th column of bupa). We should note in our metadata description that we removed duplicated rows. Also, we have over 1 million rows for these datasets. Removing ~ 2000 should still leave us with a lot of records, right?

weixuanfu commented 4 years ago

I think this case is different with 7th column of bupa. I agree that ~2000 rows are duplicated records but I think they should challenge some ML algorithms to handle them. I think we can add a option to drop duplicated rows like we added drop_na option.

trangdata commented 4 years ago

they should challenge some ML algorithms to handle them

I don't think this is a "challenge" for algorithms. If some of these duplicates happen to be in both training and testing sets, we have a case of overfitting.

trangdata commented 4 years ago

After our discussion today, we agree that we cannot verify the nature of the duplicated rows. Therefore, we will keep them in the dataset. However, I will make a note in the description field of the metadata.

trangdata commented 4 years ago

@weixuanfu Could you help resolve the conflicts and add one missing row, please?

weixuanfu commented 4 years ago

OK, I will fix that.

weixuanfu commented 4 years ago

Hmm, I think we should remove deploy step in PR, which is main reason of those conflicts.

trangdata commented 4 years ago

Hmm, I think we should remove deploy step in PR, which is main reason of those conflicts.

Oh I see. OK.