VizierDB / vizier-scala

The Vizier kernel-free notebook programming environment
Other
34 stars 11 forks source link

Missing value lens always imputes all attributes #110

Open lordpretzel opened 3 years ago

lordpretzel commented 3 years ago

To reproduce create a dataset with two attributes with nulls at least. Then create a missing value imputation lens and select to impute only one of the attributes. In the result both attributes are imputed, but only the selected one is caveated.

lordpretzel commented 3 years ago

This notebook shows the issue, only attribute B should be imputed, but C's NULLs got imputed too showcaveatlist.vizier.zip

okennedy commented 3 years ago

Still trying to confirm this, but I don't think that the non-selected attributes are getting imputed... I think they're getting cast.

okennedy commented 3 years ago

A little more digging: It looks like this isn't an imputation, but a replacement of NULL values with 0s to keep SparkML from barfing on the training data. The problem is that the 0s are getting back into the original data as well.

okennedy commented 3 years ago

https://github.com/UBOdin/mimir-api/blob/master/src/main/scala/org/mimirdb/lenses/implementation/MissingValueLens.scala#L221 f'ref

okennedy commented 3 years ago

In terms of attribution, the nullReplacers transformer linked above is 100% for sure the culprit. Unfortunately, the way that the null value replacement currently happens is broken: We shouldn't be touching the original data. Unfortunately, the null value replacement is buried pretty deeply in the code, and I'm not really sure how to achieve a comparable effect in any other way.

I'm surprised that SparkML is not resilient to NULLs, so my instinct is to just delete these lines... but the fact that they're there worries be a bit. What are your thoughts @mrb24 ?

okennedy commented 3 years ago

Per discussion w/ @mrb24 , this is going to be a deeper fix. Bumping to 1.2

okennedy commented 2 years ago

Migrating Version 1.2 issues to 2.1