Ji-Zhang / datacleanbot

MIT License
8 stars 1 forks source link

AssertionError: one hot encoding bug on Ames housing dataset #3

Open amueller opened 5 years ago

amueller commented 5 years ago

I try to run discover_types on the numeric parts of the ames housing dataset and get an assertion:

import pandas as pd
import dataclean as dc
ames = pd.read_excel("http://www.amstat.org/publications/jse/v19n3/decock/AmesHousing.xls")
numbers = ames.loc[:, ames.dtypes != "object"]
dc.discover_types(numbers)

AssertionError: one hot encoding bug [0 1 2 3] [1. 0. 0. ... 0. 1. 0.] [1. 1. 1. ... 1. 1. 1.]

I'm not sure I understand what that means or how to fix it.

Ji-Zhang commented 5 years ago

The Bayesian model is not very stable. The bug is caused due to the missingness in the data. However, it still exists even if I have imputed the missing values. Strangely, when I detected the statistical data types column by column, the assertion didn't appear. So, for now, I fixed this bug by this workaround.

amueller commented 5 years ago

thanks. That's indeed a bit strange. But will the model do the same if you do column-by-column? It uses a latent factor model, and the point is to include all columns, right?

Ji-Zhang commented 5 years ago

thanks. That's indeed a bit strange. But will the model do the same if you do column-by-column? It uses a latent factor model, and the point is to include all columns, right?

I think it is ok to do so, as according to the paper by Isabel: 'the attributes describing the objects in a dataset are assumed to be independent'.

The key idea of the Bayesian model is that each attribute (feature/column) can be expressed as a mixture of likelihood functions, one per each considered data type, where the inferred weight associated to a likelihood function captures the probability of the attribute belonging to the corresponding data type. For example, the attribute 'gender' can 'categorical' or 'real-valued', then the likelihood model of 'gender' can be expressed by the combination of the likelihood function of 'categorical' type and the likelihood function of 'real-valued' type. Since the attributes in the dataset are assumed independent, the probabilistic model of the whole dataset can be represented by the product of the likelihood models of each attribute.

I hope it helps.

amueller commented 5 years ago

I think you're misreading this, it says

Then, given the latent low-rank representation of the data, the attributes describing the objects in a dataset are assumed to be independent, i.e.,

That's very different from what you said. It means that there is a low-rank common representation and given this the variables are independent, basically saying there's independent noise. The whole point of the paper is to find this latent representation from my understanding. cc @joaquinvanschoren

Ji-Zhang commented 5 years ago

There are two key ideas of the Bayesian model. The first one is the low-rank representation and the second one is what I said previously.

The proposed method is based on probabilistic, modeling and exploits the following key ideas: i) There exists a latent structure in the data that capture the statistical dependencies among the different objects and attributes in the dataset. Here, as in standard latent feature modeling, we assume that we can capture this structure by a low-rank representation, such that conditioning on it, the likelihood model factorizes for both number of objects and attributes. ii) The observation model for each attribute can be expressed as a mixture of likelihood models, one per each considered data type, where the inferred weight associated to a likelihood model captures the probability of the attribute belonging to the corresponding data type. We derive an efficient MCMC inference algorithm to jointly infer both the low-rank representation and the weight of each likelihood model for each attribute in the observed data.

Both the low-rank representation and the weight of each likelihood model need to be inferred. But indeed, I am not 100% sure whether the model will do the same column-by-column, maybe we can forward this question to Isabel? @joaquinvanschoren

amueller commented 5 years ago

I think it's pretty clear that it won't do the same if you remove i) because it's a key component. We could try how much it changes in practice as if it matters in practice is less clear. If it doesn't change much, the whole model becomes much easier and you could basically remove all the dependencies and replace it by a much much simpler model. I would assume Isabel tried that first or did some comparison against that as it's a pretty trivial baseline.

Ji-Zhang commented 5 years ago

I think it's pretty clear that it won't do the same if you remove i) because it's a key component. We could try how much it changes in practice as if it matters in practice is less clear. If it doesn't change much, the whole model becomes much easier and you could basically remove all the dependencies and replace it by a much much simpler model. I would assume Isabel tried that first or did some comparison against that as it's a pretty trivial baseline.

That makes sense. You are right, sorry I was mistaken. I will change the Bayesian model input back to the whole dataset.