stringdist joins fail with NAs

sfirke commented 8 years ago

df1 <- data.frame(color = c("yellow", "blue", "green", "purple", NA),
                  blah = 1:5, stringsAsFactors = FALSE)

df2 <- data.frame(color = c("yallow", "bloooooo", "gren", "purple", "0range", NA),
                  test = 6:11, stringsAsFactors = FALSE)

inner_join(df1, df2) # works
stringdist_inner_join(df1, df2) # fails

I'm not sure how problematic this is, as in practice it is typically undesirable to have NAs join with NAs. But I'd expected it to behave like dplyr's joins. If this errors intentionally on NAs in the join variables, maybe clarify the error message?

dgrtwo commented 8 years ago

Interesting, I hadn't tried or considered this use case. I agree that it shouldn't be an error, but I'm not 100% comfortable having it imitate dplyr's behavior in this case either. I'd really rather NAs never match anything (since you can't be confident a missing value is any given distance from another missing value).

How would you feel about it skipping all rows with NAs in the joining column? Separately:

What should happen on anti_join? Presumably drop them as well because you can't be confident missing values aren't close. But I often use anti_join to say "what wouldn't join" so it may be a little counterintuitive
Do you think it should show a message, warning, or nothing in cases with missing values?

sfirke commented 8 years ago

I think skipping all rows with NA in the joining column(s) is the best approach, with a message or warning (I'm not sure which) along the lines of "Skipping n rows with NA values in join column(s)".

I suppose there could be NAs records dropped from both tables in say, an inner_join, so maybe returning the count of rows excluded (total from both tables) would be excessively complicated and it could just note "Warning: NA values in join column(s), those rows will be excluded".

The anti_join case is an interesting one. I could see a case for keeping them (since they won't join - I use this function like you do), or dropping them. If dropping them, I'd print a message/warning similar to the above.

rgayler commented 8 years ago

I have just been working with a data set where skipping rows with NAs would be a very bad thing. For my problem domain an NA is better treated as a wildcard. An NA on either side of the join indicates no evidence for or against the match.

I would rather have something like a "useNA =" argument to select between na.omit, treat NA as just another value (like dplyr), or treating NA as a wildcard (if you can come up with a good way to do that).

DataStrategist commented 8 years ago

But how are na's matched at all? I think they should be dropped from all joins with a warning indicating number of dropped nas. For the case where nas shouldnt be dropped... maybe the user can try to fill the values prior to the join?

rgayler commented 8 years ago

@mexindian for my specific problem the right thing to do was treat NA as a wildcard that can match anything but doesn't increase your confidence in the match. My point is that there are three reasonable ways to treat NA in a join and it would be helpful if fuzzyjoin either supported them, or at least let the user choose and warned about what it didn't support. So the three alternatives are: (1) NA is just an ordinary value and only matches NA, (2) NA indicates a problem and the row is skipped, (3) NA is a wildcard and matches anything.

The wildcard interpretation of NA is consistent with this. It's an optimistic match, because it's saying we don't have any information to contradict it being a match.

rubenarslan commented 6 years ago

I'm currently try to merge two datasets with messy texts in several columns. The datasets are supposed to be identical except for some additional columns on either side (what I want to merge).

I want to merge on said texts to rescue some information lost through broken encoding and similar problems in the first dataset. Some of the join columns are NA (in both datasets) whereas others aren't. But a row can still be uniquely and usefully identified through its combination of real values with missing values across multiple columns.

Yet, fuzzyjoin always complains here and I have to first set NAs to some dummy values, which has other undesirable side effects (i.e. the natural dummy value for strings seemed an empty string, but then the stringdist to short strings like No is, well, short).

dgrtwo / fuzzyjoin

stringdist joins fail with NAs #14