Open Dahn-YoungDong opened 3 years ago
I am using a small vcf dataset with 60mb, you can clone the repo and should be able to run 1.prep, and then come to my question 2.filter line 20-28
I was able to reproduce the error.
> #B. From dataset with all individuals
> snps_fil_hwe_high <- Filter(snps_unind, filterOptions(minQ=30, max.missing = 0.7, maf=0.05, min.meanDP=20, max.meanDP=200, hwe=0.0001))
VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf /tmp/Rtmp6OKUzR/vcfLink_filede13332c973
--maf 0.05
--max-alleles 2
--max-meanDP 200
--hwe 0.0001
--min-meanDP 20
--minQ 30
--max-missing 0.7
--out /tmp/Rtmp6OKUzR/vcfLink_filede1363dbb43
--recode
After filtering, kept 16 out of 16 Individuals
Outputting VCF file...
After filtering, kept 0 out of a possible 75609 Sites
No data left for analysis!
Run Time = 0.00 seconds
But the issue looks to be upstream of that, step A returns 0 snps.
> #A. From dataset with low missing by individuals
> snps_fil_hwe_low <- Filter(snps_ind_pc, filterOptions(minQ=30, max.missing = 0.7, maf=0.05, min.meanDP=20, max.meanDP=200, hwe=0.0001))
VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf /tmp/Rtmp6OKUzR/vcfLink_filede13332c973
--maf 0.05
--max-alleles 2
--max-meanDP 200
--hwe 0.0001
--min-meanDP 20
--minQ 30
--max-missing 0.7
--out /tmp/Rtmp6OKUzR/vcfLink_filede12bf2fc04
--recode
After filtering, kept 16 out of 16 Individuals
Outputting VCF file...
After filtering, kept 0 out of a possible 75609 Sites
No data left for analysis!
Run Time = 0.00 seconds
> VCFsummary(snps_fil_hwe_low) #16 individuals and 0 SNPs.
16 individuals and 0 SNPs.
I played with the filter settings a bit. Retaining only minQ
is sufficient to reject the reads. Lowering it to 0 retains all the reads. So a setting of 30 was rejecting all the reads. You should figure out what quality score distribution you have and pick a sensible value based on that.
cc @lemellenthin
> snps_fil_hwe_low <- Filter(snps_ind_pc, filterOptions(minQ=30, max.missing = 0.7, maf=0.05, min.meanDP=20, max.meanDP=200, hwe=0.0001))
VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf /tmp/Rtmp6OKUzR/vcfLink_filede13332c973
--maf 0.05
--max-alleles 2
--max-meanDP 200
--hwe 0.0001
--min-meanDP 20
--minQ 30
--max-missing 0.7
--out /tmp/Rtmp6OKUzR/vcfLink_filede16538a43a
--recode
After filtering, kept 16 out of 16 Individuals
Outputting VCF file...
After filtering, kept 0 out of a possible 75609 Sites
No data left for analysis!
Run Time = 0.00 seconds
> snps_fil_hwe_low <- Filter(snps_ind_pc, filterOptions(minQ=30))
VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf /tmp/Rtmp6OKUzR/vcfLink_filede13332c973
--minQ 30
--out /tmp/Rtmp6OKUzR/vcfLink_filede16fdc684a
--recode
After filtering, kept 16 out of 16 Individuals
Outputting VCF file...
After filtering, kept 0 out of a possible 75609 Sites
No data left for analysis!
Run Time = 1.00 seconds
> snps_fil_hwe_low <- Filter(snps_ind_pc, filterOptions(minQ=20))
VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf /tmp/Rtmp6OKUzR/vcfLink_filede13332c973
--minQ 20
--out /tmp/Rtmp6OKUzR/vcfLink_filede167a37ff2
--recode
After filtering, kept 16 out of 16 Individuals
Outputting VCF file...
After filtering, kept 0 out of a possible 75609 Sites
No data left for analysis!
Run Time = 0.00 seconds
> snps_fil_hwe_low <- Filter(snps_ind_pc, filterOptions(minQ=0))
VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf /tmp/Rtmp6OKUzR/vcfLink_filede13332c973
--minQ 0
--out /tmp/Rtmp6OKUzR/vcfLink_filede13e09842f
--recode
After filtering, kept 16 out of 16 Individuals
Outputting VCF file...
After filtering, kept 75609 out of a possible 75609 Sites
Run Time = 0.00 seconds
All the quality scores are set to 13. So it appears that there is essentially no quality information and that a default value has been entered at some point:
> q = Query(snps_ind_pc, c("site-quality"))
> min(q$QUAL)
[1] 13
> max(q$QUAL)
[1] 13
I suggest removing quality filtering, eg:
snps_fil_hwe_low <- Filter(snps_ind_pc, filterOptions(max.missing = 0.7, maf=0.05, min.meanDP=20, max.meanDP=200, hwe=0.0001))
Thank you, I will check it out. Happy Thanksgiving
On Thu, Nov 26, 2020 at 8:44 AM Casey Dunn notifications@github.com wrote:
All the quality scores are set to 13. So it appears that there is essentially no quality information and that a default value has been entered at some point:
q = Query(snps_ind_pc, c("site-quality")) min(q$QUAL) [1] 13 max(q$QUAL) [1] 13
I suggest removing quality filtering, eg:
snps_fil_hwe_low <- Filter(snps_ind_pc, filterOptions(max.missing = 0.7, maf=0.05, min.meanDP=20, max.meanDP=200, hwe=0.0001))
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dennisdong123/finalproject/issues/4#issuecomment-734336624, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANUSUW5UPIHMELOQDGWOTALSRZSWLANCNFSM4UBWC27Q .
-- Dahn-Young (Young) Dong [image: A button with "Hear my name" text for name playback in email signature] https://www.name-coach.com/danyang-dong
https://github.com/dennisdong123/finalproject/blob/73247ca5ffe147d26a595a970145d460035a29e6/2.filter.Rmd#L12-L28
I am closely following https://github.com/jdalapicolla/LanGen_pipeline_version2 But my data after the filter show no data left. Why? how do it make the script work?