InferenceQL / lpm.fidelity

Apache License 2.0
0 stars 0 forks source link

Deal with non overlap more gracefully #4

Closed Schaechtle closed 2 months ago

Schaechtle commented 4 months ago

What does this do?

Assessing fidelity for pairs of columns only really makes sense if the pairs overlap in the reference data. For example, in the following, we don't know what the distance between pair_1 and pair_2 should be because we don't actually observe the two columns foo and bar together in the first data frame.

# Pair from the first data frame
pair_1 = (pl.Series("foo", ["a", "a", None, None]),
          pl.Series("bar", [None, None, "y", "y"]),)
# Pair from the second data frame
pair_2 = (pl.Series("foo", ["a", "a", "a", "b"]),
          pl.Series("bar", ["x", "x", "x", "y"]),)

By default, this throws an error - arguably the safest behavior.

This PR allows users to change this default behavior; returning None instead for a comparison that fits that pattern.

Why do we want this?

Empirically we found that sometimes, this should not throw an error.

However, in some cases, this is true for large sparse data matrices like we currently use them to train LPMs.

How was this tested?

The second commit in this PR adds tests.