Huh / collar

Utilities for exploring telemetry data
Other
7 stars 9 forks source link

Find duplicates #3

Open Huh opened 6 years ago

Huh commented 6 years ago

We need a function to find duplicate entries in the GPS data. This could be very simple, but some thought should be devoted to the implementation and the action that is taken when duplicates are found.

foresthayes commented 6 years ago

I'm struggling to think of reasons to retain duplicate rows. In my current workflow I always download all collar data (so you don't miss collars with no new points), combine with a local database and remove duplicate rows with something like this: unique(data[1:length(data[1, ])])

Thoughts?

Huh commented 6 years ago

I agree that they are not really useful. The point of being a unique function is that we can report to the user how many duplicates were found and in the event someone is studying or cares about duplicates we can not remove them without having to modify other code.

I like the tidyverse and would argue that relying on those packages is a decent bet (i.e. they will be maintained, etc). If we go tidy then I would use

cllr_rm_duplicates <- function(x, ...){
  x %>%
    dplyr::distinct(...)
}

#  Example call
cllr_rm_duplicates(df, val)
cllr_rm_duplicates(df)

If we want to go base R then I would like to consider the difference between various unique type calls and base::duplicated()

cllr_rm_duplicates <- function(x, ...){
  x[!duplicated(x),...]
}

#  Example call
cllr_rm_duplicates(df, "val")

I don't like that this approach returns a vector. What goes in should come out. I don't want to have to guess at what the output type is as I believe it should be perfectly consistent.

Those are my thoughts