autogluon / tabrepo

Apache License 2.0
27 stars 7 forks source link

`dataset` or `tid` for public API #12

Closed Innixma closed 8 months ago

Innixma commented 1 year ago

A general question that we might ask is if we want to use dataset or tid as the public API, or support both.

dataset = 'abalone'
tid     = 359946

I think both can be used as the primary key, as repo will crash if there are duplicates (aka multiple tid map to same dataset, or multiple dataset map to same tid)

I’m leaning towards logic where tid is the primary key, and if dataset doesn’t exist, we just map them so dataset == tid , and tid could be str or int, so as not to tie us directly to OpenML.

geoalgo commented 1 year ago

Support both makes sense to me given that tid is a concept tied to OpenML (we may have dataset with unique names that do not have a tid for instance some coming from kaggle).

Given that tid only works for OpenML I think it would make more sense to have dataset as the primary key, also not sure about having dataset == tid given that one is a string and the other is an integer.

Innixma commented 8 months ago

Updated to dataset as primary key.