carloscinelli / benford.analysis

Tools that make it easier to use Benford’s law for data validation and forensic analytics.
61 stars 15 forks source link

Write vignette #4

Open carloscinelli opened 9 years ago

carloscinelli commented 9 years ago

Write package vignette.

Henry-E commented 8 years ago

It would be really helpful if you could outline how to interpret the values returned from the print and plot functions

Henry-E commented 8 years ago

Hey, I made up an rmarkdown document comparing a couple of different distributions that typically follow benfords law (lognormal with high standard deviation, exponential) and some that don't (lognormal with small standard deviation, normal, uniform).

In terms of using the output of the print and plot functions to determine which groups of numbers follow benfords law, I'm having trouble understanding how to interpret them succesfully. Aside from the classic digits distribution chart, it's hard to see what indicators could be used in identifying rogue sets of numbers. I know you say to not focus on the p-values but for many of the distrubtions that I've looked at in practice that appear to follow Benford's law, still have tiny p-values (< 1e5 in many cases).

Do you have any reccomedations or tips for how you use the package. Like yourself, I'm trying to apply this in a central bank.

---
title: "Benford's law, digit analysis"
output: 
  html_document:
    toc: true
    toc_float: true
# params:
#   numberOfDigits: 1
#   columnToGroupBy: NA
#   data: NA
---

```{r setup, echo=FALSE,  warning=FALSE, message=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(dplyr)
library(stringr)
library(benford.analysis)
library(pander)
options(stringsAsFactors = FALSE)

numObs <- 1e4
data <- tbl_df(bind_rows(data.frame(value = runif(numObs), distribution = "uniform"), 
                         data.frame(value = abs(rnorm(numObs)), distribution = "normal"),
                         data.frame(value = rexp(numObs, rate = 1), distribution = "exponential"),
                         data.frame(value = rlnorm(numObs, meanlog = 1, sdlog = 0.6), distribution = "lognormal - narrow"),
                         data.frame(value = rlnorm(numObs, meanlog = 10, sdlog = 100), distribution = "lognormal - wide")))
columnToGroupBy <- c("distribution")

params <- list()
params$numberOfDigits <- 1
params$columnToGroupBy <- columnToGroupBy
params$data <- data

# we tried to use NULL here before but that leads to an issue
# https://github.com/rstudio/rmarkdown/issues/729
if(params$columnToGroupBy == "NA" || params$data == "NA")
  stop("missing column name(s) or data")

if(ncol(params$data) < 2)
  stop("not enough columns in the data")

areColumnsPresent <- c(params$columnToGroupBy, "value") %in% colnames(params$data)
if(!all(areColumnsPresent))
  stop(str_c(c("columns requested are not present:", params$columnToGroupBy), collapse = "\n"))

dataTypeOfValueColumn <- sapply(params$data, class)["value"]
if(dataTypeOfValueColumn != "integer" && dataTypeOfValueColumn != "numeric")
  stop(str_c("Data type of the value column should be integer or numeric, instead it is: ", dataTypeOfValueColumn))

if(params$numberOfDigits > 2)
  stop(str_c("checking that number of digits is probably not a great idea ", params$numberOfDigits))
​​​​```

```{r benfordAnalysis, results = "asis"}
# the do() function applies the benford function to each group
# and create a list of S3 objects
# https://stat545-ubc.github.io/block023_dplyr-do.html#meet-do
allTheBenfords <- params$data %>%
  group_by_(.dots = params$columnToGroupBy) %>%
  do(benford =  benford(.$value, number.of.digits = params$numberOfDigits))

# order them by the Mean Absolute Deviation, it's not a totally accurate way to do
# it but it's better than nothing. I couldn't figure out how to do it as part
# the dplyr chain, the S3 object subsetting got the better of me. I was able to
# use do and the .(dot) operator to get a column of the values but it deleted all
# the other columns for some reason and I couldn't use the same syntax with mutate
allTheBenfords <- allTheBenfords[order(sapply(allTheBenfords$benford,
                                              function(x) x[["MAD"]]), decreasing = TRUE), ]
# We're testing out ranking by the mantissa arc 
# allTheBenfords <- allTheBenfords[order(sapply(allTheBenfords$benford, 
#                                               function(x) x$stats$mantissa.arc.test$p.value), decreasing = TRUE), ]

for(observation in seq(nrow(allTheBenfords))){

  thisBenfordAnalysis <- allTheBenfords$benford[[observation]]
  cat("\n\n## ", allTheBenfords[observation, params$columnToGroupBy][[1]], "\n\n")

  plot(thisBenfordAnalysis)
  cat("\n")

  # cribbed from the print function for benford.analysis class, which does not
  # print well in rmarkdown
  # https://github.com/carloscinelli/benford.analysis/blob/master/R/functions-new.R
  overallDescription <- data.frame(description =
                                  c("Number of observations used", 
                                     "Number of obs. for second order", 
                                     "First digits analysed"),
                                  value =
                                   c(thisBenfordAnalysis[["info"]]$n, 
                                     thisBenfordAnalysis[["info"]]$n.second.order,
                                     thisBenfordAnalysis[["info"]]$number.of.digits))
  cat("\n#### Overall description of analysis\n")
  cat(pandoc.table(overallDescription))

  cat("\n\n#### Mantissa arc test \n")
  mantissaTable <- thisBenfordAnalysis$mantissa
  mantissaTable$statistic <-  gsub("Mantissa|\\s", " ", mantissaTable$statistic)
  mantissaTable$`expected value` <- c(0.5, 0.08333, -1.2, 0)
  cat(pandoc.table(mantissaTable))
  cat("\n\n#### Numerical deviations")
  how.many <- 5
  cat("\nThe", how.many, "largest deviations: \n\n")
  cat(pandoc.table(round(head(thisBenfordAnalysis[["bfd"]][order(absolute.diff, decreasing=TRUE)][,list(digits, absolute.diff)], how.many), 2)), sep="\n")
  deviationStats <- data.frame(description =
                                 c("Mean Absolute Deviation",
                                   "Distortion Factor"),
                               value =
                                 c(thisBenfordAnalysis[["MAD"]],
                                   thisBenfordAnalysis[["distortion.factor"]]))
  cat(pandoc.table(deviationStats))

  cat("\n\n#### Statistics\n")
  cat("\n\n##### Pearson's Chi-squared test\n")
  chisqStats <- data.frame(description =
                            c("X-squared",
                              "df",
                              "p-value <"),
                           value =
                             c(thisBenfordAnalysis[["stats"]]$chisq$statistic,
                               thisBenfordAnalysis[["stats"]]$chisq$parameter,
                               thisBenfordAnalysis[["stats"]]$chisq$p.value))
  cat(pandoc.table(chisqStats))
  cat("\n\n##### Mantissa Arc Test\n")
  mantissaStats <- data.frame(description = 
                                c("L2",
                                  "df",
                                  "p-value <"),
                              value = 
                                c(thisBenfordAnalysis[["stats"]]$mantissa.arc.test$statistic,
                                  thisBenfordAnalysis[["stats"]]$mantissa.arc.test$parameter,
                                  thisBenfordAnalysis[["stats"]]$mantissa.arc.test$p.value))
  cat(pandoc.table(mantissaStats))
  cat("\n\nRemember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!\n\n")
}
​​​​```
carloscinelli commented 8 years ago

Thanks for the comment, Henry! I will take a look at your example and try to explain it better. But just to give a quick answer, the focus here is to have one more indicator (which you will combine with other indicators) to find suspicious data that will need further investigation. You shouldn't use it to get a clear cut "yes/no" answer about the quality of a data set or even a single data point.

vonjd commented 4 years ago

When will the vignette be ready?