Open cwtyson opened 3 years ago
See possible example below:
data.frame(ID = c("ABC1234", "CCC3456", "no_ident_abc123", "AB-5432", "ABC1234/ABC4567")) %>%
mutate(correctID = dplyr::case_when(stringr::str_detect(ID, pattern = "^[A-Z]{3}[0-9]{4}$") ~ ID))
This is quite a strict example that would exclude small typos (e.g. row 4); however, we don't want to spend our time writing a script to cover every possible mistake the data owner might make so this conservative approach seems best.
That looks like a good general approach. We could also do some fuzzy matching to make it a bit less strict, for instance allowing one typo. str_find() from 'sjmisc' can do this though that is not a package we are already using. Perhaps you know of other options?
Hmm I know a few packages that can do fuzzy string matching/calculate string distance, but I don't think these take regex input. Can you write out a reprex for sjmisc
?
Looks like base R's agrep() might be an option. I didn't realize it could take regular expressions as patterns. I thought it was only character strings.
I can have a go at a reprex.
agrep()
looks perfect. I also thought this didn't take regex, but looks like this is something that has been added (see old help doc here). That would be much preferable as we can avoid added dependencies.
Here is something that would return rows 1, 2, and 4 from your earlier example
data.frame(ID = c("ABC1234", "CCC3456", "no_ident_abc123", "AB-5432", "ABC1234/ABC4567")) %>% dplyr::mutate(correctID = dplyr::case_when(ID %in% ID[agrep(pattern = "^[A-Z]{3}[0-9]{4}$", x = ID, fixed = F, value = F)] ~ ID))
@cwtyson it would be good to implement this for one pipeline as a proof of concept (perhaps PEW?).
In multiple populations, there are IDs that are clearly incorrect in the IndivID and LocationID. These should be NAs and filtered out.
One way to solve this in the case of IndivID would be to remove IDs that are too different from the normal ringing sequence. For example, if IDs are typically of the form 'XX11111', then any ID that does not match this sequence can be treated as NA.
For LocationID, there might not be as consistent of a sequence that is used. One possibility would be to look for strings that are much longer and remove these (or check with the data owner).