HoloClean / holoclean

A Machine Learning System for Data Enrichment.
http://www.holoclean.io
Apache License 2.0
514 stars 129 forks source link

Repairs are no longer being found as the size of a dataset is increased #41

Closed j-r77 closed 4 years ago

j-r77 commented 5 years ago

Hi,

I am trying to clean a few dirty rows with respect to 2 denial constraints: t1&EQ(t1.Sex,"Female")&EQ(t1.Relationship,"Husband") t1&EQ(t1.Sex,"Male")&EQ(t1.Relationship,"Wife")

When I use a small sample of my data (20 rows), I get the following correct repairs:

(11, u'relationship', 'Husband', u'Husband')

#(11, u'sex', 'Female', u'Male')

(4, u'relationship', 'Husband', u'Husband')

#(4, u'sex', 'Female', u'Male')

When I add some more rows, for example 1100, the values no longer change:

(4, u'relationship', 'Husband', u'Husband')

(4, u'sex', 'Female', u'Female')

(11, u'relationship', 'Husband', u'Husband')

(11, u'sex', 'Female', u'Female')

(652, u'relationship', 'Husband', u'Husband')

(652, u'sex', 'Female', u'Female')

(689, u'relationship', 'Wife', u'Wife')

(689, u'sex', 'Male', u'Male')

(504, u'relationship', 'Husband', u'Husband')

(504, u'sex', 'Female', u'Female')

(26, u'relationship', 'Husband', u'Husband')

(26, u'sex', 'Female', u'Female')

(1084, u'relationship', 'Husband', u'Husband')

(1084, u'sex', 'Female', u'Female')

(703, u'relationship', 'Wife', u'Wife')

(703, u'sex', 'Male', u'Male')

The same happens when I run on the full dataset, 48842 rows. The data is the same; e.g., the version with 20 rows just contains the first 20 rows from the full set.

My script is based on the example in tests/, and I use default settings (I tried some tweaking but this did not solve the issue) Noisy cells are detected correctly and the generated possible domains contain multiple values

I do not understand why this happens, which problem or restriction prevents the repairs from happening on the larger versions of the dataset ...

Code and data can be found in my fork: https://github.com/j-r77/holoclean

thodrek commented 5 years ago

Hi, I would recommend using the HC version in Dev as we have fixed many issues there. Can you please let us if the error persists with the dev version? If so we would be happy to investigate.

j-r77 commented 5 years ago

Hello,

I have just tried it in the latest dev version, but the issue persists. Values are repaired in the 20-tuple dataset, but the same values no longer change as a larger portion of the data is considered (and the new errors in the larger datasets are not repaired either).

minafarid commented 5 years ago

Hi @j-r77 I am currently working on reproducing and debugging this issue and will get back to you.

fgeerts commented 5 years ago

Hi @minafarid Just wondering whether you managed to reproduce and debug the issue already? Thanks.

thodrek commented 5 years ago

Hi @fgeerts we are actively working on this issue. It is a bit more intricate than what it seems. This issue comes up because the only attributes that are strongly correlated in the Adult dataset are "relationship" and "sex", i.e., the ones present in your constraints (see attached image).

screen shot 2018-12-17 at 11 42 45 am

We are actively working on this issue and we will be getting back to you ASAP.

richardwu commented 5 years ago

Hi @j-r77:

We did some digging around and it seems that the issue lies in the use of InitAttFeaturizer. Because of how we currently do weak supervision, our InitAttFeaturizer feature weights actually blows up and will assign to much emphasis on the initial values which causes no repairs to occur.

If you pass in the keyword argument learnable=False, you should be able to see better results. We've recently tweaked how we do weak supervision in #43 such that InitAttFeaturizer behaves as intended.

That being said with this specific dataset as @thodrek pointed out, since there are so few correlated attributes weak supervision fails to assign confident weak labels and results in the prior behaviour.

In this case Holoclean actually prefers not to repair any cell as demonstrated because it is unconfident that any repairs are correct due to the lack of correlations.

Hope that helps.

fgeerts commented 5 years ago

Hi Richard,

Thanks for looking further into this. We’ll take a closer look at this InitAttFeaturizer. Of course, it makes sense not to repair if no strong signals are present. Thanks again,

-Floris

On 19 Jan 2019, at 00:24, Richard Wu notifications@github.com wrote:

Hi @j-r77 https://github.com/j-r77:

We did some digging around and it seems that the issue lies in the use of InitAttFeaturizer. Because of how we currently do weak supervision, our InitAttFeaturizer feature weights actually blows up and will assign to much emphasis on the initial values which causes no repairs to occur.

If you pass in the keyword argument learnable=False, you should be able to see better results. We've recently tweaked how we do weak supervision in #43 https://github.com/HoloClean/holoclean/pull/43 such that InitAttFeaturizer behaves as intended.

That being said with this specific dataset as @thodrek https://github.com/thodrek pointed out, since there are so few correlated attributes weak supervision fails to assign confident weak labels and results in the prior behaviour.

In this case Holoclean actually prefers not to repair any cell as demonstrated because it is unconfident that any repairs are correct due to the lack of correlations.

Hope that helps.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/HoloClean/holoclean/issues/41#issuecomment-455719530, or mute the thread https://github.com/notifications/unsubscribe-auth/ADDwGezah7T-s_oW0aEUWR3I8sv-dsfXks5vEleigaJpZM4ZAqiV.