Open lordpretzel opened 3 years ago
This notebook shows the issue, only attribute B should be imputed, but C's NULLs got imputed too showcaveatlist.vizier.zip
Still trying to confirm this, but I don't think that the non-selected attributes are getting imputed... I think they're getting cast.
A little more digging: It looks like this isn't an imputation, but a replacement of NULL values with 0s to keep SparkML from barfing on the training data. The problem is that the 0s are getting back into the original data as well.
In terms of attribution, the nullReplacers
transformer linked above is 100% for sure the culprit. Unfortunately, the way that the null value replacement currently happens is broken: We shouldn't be touching the original data. Unfortunately, the null value replacement is buried pretty deeply in the code, and I'm not really sure how to achieve a comparable effect in any other way.
I'm surprised that SparkML is not resilient to NULLs, so my instinct is to just delete these lines... but the fact that they're there worries be a bit. What are your thoughts @mrb24 ?
Per discussion w/ @mrb24 , this is going to be a deeper fix. Bumping to 1.2
Migrating Version 1.2 issues to 2.1
To reproduce create a dataset with two attributes with nulls at least. Then create a missing value imputation lens and select to impute only one of the attributes. In the result both attributes are imputed, but only the selected one is caveated.