EpistasisLab / pmlb

PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms.
https://epistasislab.github.io/pmlb/
MIT License
805 stars 135 forks source link

molecular_biology_promoters issues #91

Closed lacava closed 4 years ago

lacava commented 4 years ago
trangdata commented 4 years ago

instance still appears in the pandas profiling report

You can see the updated report currently on gh-pages here. Even though everything was deployed correctly, the site is not being built. I think we have reached the repo size limits.

site-build-error

I'll troubleshoot this, but I think we need to think long term about tracking with LFS and other ways to reduce the repo size.

target description is wrong; 0 corresponds to promoters. these labels might need to be flipped to match source.

Did you mean this? Isn't this correct?

lacava commented 4 years ago

description: Positive class indicates a promoter. code: -:1, +:0 (promoter)

In PMLB, 0 indicates a promoter. In the original data, class labels were (+,-), with + indicating promoter. It seems more in line with that encoding to have "1" indicate a promoter. In either case the description in the metadata is wrong, positive class in PMLB indicates NOT a promoter.

trangdata commented 4 years ago

Hmm not sure if I'm missing something completely here...

Currently, in PMLB, 0 = + = promoter (both metadata and data). In the original data, + = promoter.

I agree it's more conventional to encode 1 as promoter. We can flip that. But I don't think what we have currently is wrong.

lacava commented 4 years ago

The description in metadata.yaml reads "Positive class indicates promoter". While this is true for the source data, for the PMLB data, 0 indicates promoter. So I'm suggesting we update the description.

trangdata commented 4 years ago

Ah, I read this "positive" as literal "+". Let's go ahead and recode the target then.