Removing incorrect IDs in IndividualID and LocationID columns

SPI-Birds / pipelines

Pipelines for generating a standard data format for bird data

2 stars 6 forks source link

Removing incorrect IDs in IndividualID and LocationID columns #172

Open cwtyson opened 3 years ago

cwtyson commented 3 years ago

In multiple populations, there are IDs that are clearly incorrect in the IndivID and LocationID. These should be NAs and filtered out.

One way to solve this in the case of IndivID would be to remove IDs that are too different from the normal ringing sequence. For example, if IDs are typically of the form 'XX11111', then any ID that does not match this sequence can be treated as NA.

For LocationID, there might not be as consistent of a sequence that is used. One possibility would be to look for strings that are much longer and remove these (or check with the data owner).

LiamDBailey commented 3 years ago

See possible example below:

data.frame(ID = c("ABC1234", "CCC3456", "no_ident_abc123", "AB-5432", "ABC1234/ABC4567")) %>% 
    mutate(correctID = dplyr::case_when(stringr::str_detect(ID, pattern = "^[A-Z]{3}[0-9]{4}$") ~ ID))

This is quite a strict example that would exclude small typos (e.g. row 4); however, we don't want to spend our time writing a script to cover every possible mistake the data owner might make so this conservative approach seems best.

cwtyson commented 3 years ago

That looks like a good general approach. We could also do some fuzzy matching to make it a bit less strict, for instance allowing one typo. str_find() from 'sjmisc' can do this though that is not a package we are already using. Perhaps you know of other options?

LiamDBailey commented 3 years ago

Hmm I know a few packages that can do fuzzy string matching/calculate string distance, but I don't think these take regex input. Can you write out a reprex for sjmisc?

cwtyson commented 3 years ago

Looks like base R's agrep() might be an option. I didn't realize it could take regular expressions as patterns. I thought it was only character strings.

cwtyson commented 3 years ago

I can have a go at a reprex.

LiamDBailey commented 3 years ago

agrep() looks perfect. I also thought this didn't take regex, but looks like this is something that has been added (see old help doc here). That would be much preferable as we can avoid added dependencies.

cwtyson commented 3 years ago

Here is something that would return rows 1, 2, and 4 from your earlier example

data.frame(ID = c("ABC1234", "CCC3456", "no_ident_abc123", "AB-5432", "ABC1234/ABC4567")) %>% dplyr::mutate(correctID = dplyr::case_when(ID %in% ID[agrep(pattern = "^[A-Z]{3}[0-9]{4}$", x = ID, fixed = F, value = F)] ~ ID))

LiamDBailey commented 3 years ago

@cwtyson it would be good to implement this for one pipeline as a proof of concept (perhaps PEW?).

Convert IDs that don't match the expected syntax into NA
Update test-XXX.R so that we test it is working