HenrikBengtsson / Wishlist-for-R

Features and tweaks to R that I and others would love to see - feel free to add yours!
https://github.com/HenrikBengtsson/Wishlist-for-R/issues
GNU Lesser General Public License v3.0
133 stars 4 forks source link

Fast check for discreteness #145

Open mayer79 opened 1 year ago

mayer79 commented 1 year ago

In statistics, constructions like this are quite common:

if (length(unique(x)) > 27) {
  Some binning
}

If x is long and continuous, calling unique() seems inefficient (even if it uses a hash logic).

It would therefore be fantastic to have a function nunique(x, nmax=length(x)). It would safely return nmax if the number of distinct values is at least that large.

Example: If x is continuous with 1e10 disjoint values, nunique(x, 27) would return 27, and the operation would only have complexity O(27).

Note: unique() has an argument nmax, but it seems to be for memory allocation of the hash table, and probably not a safe way to achieve such task.