dcomtois / summarytools

R Package to Quickly and Neatly Summarize Data
502 stars 77 forks source link

Indicate column where "Number of Distinct values" = "Total Number of rows" #170

Open dangus-aktivbo opened 2 years ago

dangus-aktivbo commented 2 years ago

Scanning numeric columns, I quickly wish to find out which columns have unique, distinct, values on each row.

The usefulness of dfSummary in scanning columns quickly, and figuring out the structural and statistical properties of each column. Normally, when I dig into datasets, I try to quickly find out if natural keys, like social security number, housing address, customer id etc are duplicated. The simplest way now, is to do a count-distinct (eg n_distinct(x) in dplyr) and compare distinct values to the row number of the data frame. I'm using dfSummary a lot, and think this would be a super enhancement.

One possible solution is to add a "% distinct" value on the marked columns since you have a (% of valid) in the column header. Or a "flag" like a string saying "Unique" or "(all unique)" or something. Now I have to check the Freqs against the row count, which of course is just a minor inconvenience... Anyway.

image

dcomtois commented 1 year ago

This is a good idea. I'd go for the "All distinct values", however, a new term ("All") will need to be added to the translations dataset, which will require some work. Help is always welcome.