how to get the result of two-sided Fisher’s Exact Test

weizhiting commented 7 years ago

Hi, i am reading your "Unexpected mutations after CRISPR-Cas9 editing in vivo” are most likely pre- existing sequence variants and not nuclease-induced mutations" paper. In your paper,"in comparing the distribution of known dbSNP variants to “novel” sequence variants (including the ones attributed to CRISPR-Cas9 by Schaefer et al.; see Supplementary Note 1), we could not reject (p = 0.304; two-sided Fisher’s Exact Test) the null hypothesis that the rates of shared pairwise genotypes with these “novel” sequence variants (Fig. 1d) differed from the distribution of dbSNP variants", could you please show me the command line? I can not find in the github. Thanks!

caleblareau commented 7 years ago

Hey @weizhiting,

See the R code here: https://github.com/aryeelab/crispr_mutation_reanalysis/blob/master/SNV_reanalysis/code_plots/mainText/Figure1.R#L58

The numbers can realized from the following:

Calebs-MacBook-Pro-2:main_text lareauc$ pwd
/Volumes/dat/Research/AryeeResearch/crispr_mutation_reanalysis/SNV_reanalysis/new_processed_calls/main_text
Calebs-MacBook-Pro-2:main_text lareauc$ wc -l *
    9149 dbSNP.F03.bed
    8975 dbSNP.F05.bed
   12954 dbSNP.FVB.bed
   12003 novel.F03.bed
   10881 novel.F05.bed
   16097 novel.FVB.bed
   70059 total

Thanks for taking a closer look at the data... Don't hesitate to ask any other questions! Myself, @martinaryee, @lucapinello, @kclem, and the other co-authors would be happy to pass along any other information that would be helpful!

weizhiting commented 7 years ago

Thank you for your reply,but when i use matrix(c(12954, 9149, 8975, 16097, 12003, 10881), 2, 2) do the fisher exat test, the p-value is smaller than 2.2e-16; i am a bit confused!

caleblareau commented 7 years ago

Did you see Line 58 that I linked?

> fisher.test(matrix(c(12954, 31079-12954, 16097, 38981- 16097),nrow=2))

    Fisher's Exact Test for Count Data

data:  matrix(c(12954, 31079 - 12954, 16097, 38981 - 16097), nrow = 2)
p-value = 0.3047
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.9856526 1.0473656
sample estimates:
odds ratio 
  1.016049

weizhiting commented 7 years ago

Yes,I see.But i just can not understand why do not use matrix(c(12954, 9149, 8975, 16097, 12003, 10881), 2, 2) do the fisher exat test.Thanks!

caleblareau commented 7 years ago

First note that the code chunk that you posted probably doesn't give you what you want:

>  matrix(c(12954, 9149, 8975, 16097, 12003, 10881), 2, 2)
      [,1]  [,2]
[1,] 12954  8975
[2,]  9149 16097

For a Fisher's exact test, you want to structure a contingency table that partitions your observations into four total categories. (https://en.wikipedia.org/wiki/Fisher%27s_exact_test) For our test, we did the following:

                            in dbSNP         in Novel

Unique to FVB                 a                 b

Not unique to FVB             c                 d

Total number of novel variants: 38981 (12003+10881+16097) Total number of dbSNP variants: 31079

These two numbers should reflect our column sums. Then to populate a, b, c, and d, we note that there are 12954 variants unique to FVB in dbSNP and 16097 variants unique to FVB that are "novel". These two numbers make up a and b in the contingency table. c and d then would be the colsums minus a and b, which is what we've shown in the text snippet above / in our code.

@weizhiting is this reasonably clear? You can't just put all those numbers in to run the Fisher test. You could do something similar to what you did (and indeed, it's what we did in our chi-squared test) but requires two degrees of freedom. We did this to compare our observed ratios to the isogenic model.

caleblareau commented 7 years ago

@weizhiting another way of thinking about it is that for the Fisher's test, we want to answer "do the observed variant proportions observed in FVB at novel loci differ significantly from the proportions of dbSNP loci?"

Under the null, we assume this to be true, and our level of statistical evidence suggests, even with a large sample size, that we cannot reject this. Thus, we interpret that "novel" loci are no more likely to have an alternate allele than dbSNP loci.

weizhiting commented 7 years ago

got it. Thank you very much!

aryeelab / crispr_mutation_reanalysis

how to get the result of two-sided Fisher’s Exact Test #1