ANGSD / angsd

Program for analysing NGS data.
230 stars 50 forks source link

setMaxDepth not working? #144

Closed mlucenaperez closed 5 years ago

mlucenaperez commented 6 years ago

Dear developers,

I am working with ANGSD (version: 0.917-123-gf8b7a92 (htslib: 1.4.1-16-g499245e)), doing some test on the effect of missing data in the inferences. I compute coverage per site of the samples I am interested in, by doing:

samtools depth $sample > $sample.covpersite

And then join the results with my data of diversity per site coming out of ANGSD for a particular region.

For diversity calculus, I used some depth filters that are:

maxDepth = mean_Depth + (0.95 mean_Depth) = 127.192126110922 minDepth = mean_Depth - (0.95 mean_Depth) = 3.26133656694672

I have 12 individuals in my population, and I set - minInd = 6

My code is:

SAF likelihood: /opt/angsd/angsd/angsd -P 1 -b POP.bamlist -ref my_ref.fa -anc my_anc.fa -out output_name -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -baq 1 -C 50 -minMapQ 30 -minQ 20 -doCounts 1 -GL 2 -doSaf 1 -rf my_rf_file.rf -minInd 6 -setMaxDepth 127.192126110922 -setMinDepth 3.26133656694672

Then I calculate SFS, and:

SAF postprob: /opt/angsd/angsd/angsd -P 1 -b POP.bamlist -ref my_ref.fa -anc my_anc.fa -out output_name.postprob -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -baq 1 -C 50 -minMapQ 30 -minQ 20 -doCounts 1 -GL 2 -doSaf 1 -rf my_rf_file.rf -minInd 6 -setMaxDepth 127.192126110922 -doThetas 1 -pest my_sfs_file.sfs

I did: $NGSTOOLS/thetaStat print output_name.postprob.thetas.idx > printed.stats

I made a left join in R of those printed.stats with my $sample.covpersite.

I discovered that many positions had a total coverage above my setMaxDepth. The maximum is a total of ~16000 reads for a given position in the population, with a minimum per individual of ~800 and a maximum of ~3000.

Both -minInd and setMinDepth seems to filter properly. I made the exact same test with a couple of populations and same results.

Any idea? The only way this could be happening is that ANGSD remove those reads that -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -baq 1 -C 50 and then compute how many reads are left and do the filter only on those left; which in my opinion might be confusing, and maybe lead to leave bad quality positions in the calculus.

ANGSD commented 5 years ago

Dear @mlucenaperez sorry for the delay, this issue got lost in the long list of other issues. Regarding your problem, I dont observe the same issue, see these commands.

//all sites ./angsd -b list -r 1 -anc ../hs37d5.fa -dosaf 1 -docounts 1 -gl 1 -dumpcounts 1 -out full //limit to sites with 200 depth ./angsd -b list -r 1 -anc ../hs37d5.fa -dosaf 1 -docounts 1 -gl 1 -dumpcounts 1 -out 200 -setmaxdepth 200 //in R depth<-read.table("full.pos.gz",he=T)

range(depth$totDepth) [1] 1 298 saf.pos<-as.numeric(system("misc/realSFS print 200.saf.idx|cut -f2",intern=T)) range(depth$totDepth[match(saf.pos,depth$pos)]) [1] 1 200

Im closing this issue feel free to reopen if needed.