Calculate nucleotide diversity during CoVizu pipeline

ebrintn commented 1 year ago

In performing assessment of the HUNePIE pipeline, we discovered that predictions are better if nucleotide diversity (pi) is included in the summary statistics. This measure is taken directly from sequences, however, @ArtPoon and I believe that all the information needed to calculate pi is already produced prior to tree construction in the CoVizu pipeline. Unfortunately, I don't know where to even begin to look in the CoVizu pipeline for this information.

ebrintn commented 1 year ago

The formula for pi:

Taken from: https://en.wikipedia.org/wiki/Nucleotide_diversity

GopiGugan commented 1 year ago

https://github.com/PoonLab/covizu/blob/bae56cc9b60f7c90b1e2d9e3229ee941d6dd8040/covizu/clustering.py#L85-L99

ArtPoon commented 1 year ago

Generated a test file:

(venv) art@orolo:~/git/covizu$ python3 covizu/utils/gisaid_utils.py --infile data/provision.2023-02-11T00\:05\:43.json.xz --debug 100000 --recode iss452.json
🏄 [0:00:00.000010] Processing GISAID feed data
🏄 [0:00:00.846075] aligned 0 records
🏄 [0:00:29.917626] aligned 10000 records
🏄 [0:00:57.618343] aligned 20000 records
🏄 [0:01:25.780476] aligned 30000 records
🏄 [0:01:52.683926] aligned 40000 records
🏄 [0:02:14.646707] filtered 52847 problematic features
🏄 [0:02:14.646764]          35751 genomes with excess missing sites
🏄 [0:02:14.646771]          6346 genomes with excess divergence

Examined resulting JSON in Python, BC.1 seems like a good target for testing (115 variants).

ArtPoon commented 1 year ago

new branch iss452, commit 9e03634b0801f48ee0b22f03b1731f92d44d6353

ArtPoon commented 1 year ago

https://github.com/PoonLab/covizu/blob/fb19dc51d8a59a495a45e2fb4ac9fb286b714502/covizu/clustering.py#L230-L247

ArtPoon commented 1 year ago

@ebrintn - can you check whether this function works? Also still need to write code to call get_diversity() as part of the batch workflow (batch.py)

ebrintn commented 1 year ago

Yep - I will verify that the output is the same as what I am getting with the get.diversity function in R that I was already using

ebrintn commented 1 year ago

I ran 50 replicates calculating the diversity using both methods. I found that across the board the method @ArtPoon implemented overestimated the value by around 3%.

    python        R
1  12.469792 12.021212
2  11.401003 11.193939
3  19.924366 18.918384
4  11.881092 11.635556
5  12.212287 11.448889
6  33.533111 32.192323
7  25.637650 24.947071
8  17.847898 17.376768
9  23.498803 22.882222
10 18.299360 17.743636
11 14.643609 14.104242
12 21.484640 20.470707
13 19.819939 19.070707
14 14.835817 14.578182
15 14.051908 13.730303
16  9.851117  9.316162
17 18.324701 17.280404
18 14.010638 13.575152
19 16.328878 15.791919
20 29.043776 27.629495
21 18.268783 17.815354
22 12.448117 11.611313
23 18.666054 17.716162
24 11.934224 11.523232
25 12.434631 11.768485
26 18.869863 17.927273
27 26.850125 25.554545
28 18.492231 18.290707
29 29.098357 27.925051
30 18.584740 18.067071
31 17.086383 16.535152
32 24.147870 22.965051
33  9.858696  9.527879
34 11.482484 11.137778
35 20.706488 20.130505
36 23.054525 22.792727
37 15.359176 14.893939
38 27.916235 26.805859
39 11.027364 10.505253
40  9.881735  9.857374
41 17.152107 16.728485
42 14.568644 14.195758
43 17.723420 17.050909
44 16.343637 15.844646
45 23.281816 22.726667
46 20.493957 19.114343
47 17.946422 17.420606
48 10.177431  9.278384
49  7.974735  6.836768
50 10.820440 10.098788

ArtPoon commented 1 year ago

Important to note that the method I implemented is counting insertions and deletions in addition to nucleotide substitutions

ArtPoon commented 1 year ago

ebrintn commented 1 year ago

Important to note that the method I implemented is counting insertions and deletions in addition to nucleotide substitutions

That would be why, the diversity.stats does not include indels. Is there a way to not include the indels? Or do we even need to? I feel like it's not that different.

ebrintn commented 1 year ago

Actually, it might not be this because I just checked a file that had no indels

ebrintn commented 1 year ago

The calculation of pi in the diversity.stats package


if(pi){

 rownames(matrix_hap) <- rownames(matrix_pol)

for(xx in 1:npops){

   freqq    <- sfreqh[xx, ]/length(populations[[xx]]) 
   freqq    <- freqq[which(freqq!=0)]

   if(nh[xx]>1){vergl <- combn(nh[xx],2)}else{PIW_nei[xx]<-0;next;}   

   res      <- apply(vergl,2,function(x){
   hap1     <- matrix_hap[names(freqq[x[1]]),]
   hap2     <- matrix_hap[names(freqq[x[2]]),]
   div      <- hap1!=hap2 # ohne substituitionsmodell
   div      <- sum(div, na.rm=TRUE)
   return(2*freqq[x[1]]*freqq[x[2]]*div)               
   })
n            <- length(populations[[xx]])
PIW_nei[xx]  <- (n/(n-1))*sum(res,na.rm=TRUE)

}

I should mention that I noticed some of this code was in German so it's not fully understandable

ArtPoon commented 1 year ago

it's difficult to work that out unless you provide the code where the variables sfreqh, populations, and matrix_hap are being assigned

ArtPoon commented 1 year ago

calculate statistics for each bootstrap tree instead of the consensus tree

PoonLab / covizu

Calculate nucleotide diversity during CoVizu pipeline #452