Dahn-YoungDong / finalproject

0 stars 1 forks source link

filtering neutral loci issue #4

Open Dahn-YoungDong opened 3 years ago

Dahn-YoungDong commented 3 years ago

https://github.com/dennisdong123/finalproject/blob/73247ca5ffe147d26a595a970145d460035a29e6/2.filter.Rmd#L12-L28

I am closely following https://github.com/jdalapicolla/LanGen_pipeline_version2 But my data after the filter show no data left. Why? how do it make the script work?

Dahn-YoungDong commented 3 years ago

I am using a small vcf dataset with 60mb, you can clone the repo and should be able to run 1.prep, and then come to my question 2.filter line 20-28

caseywdunn commented 3 years ago

I was able to reproduce the error.

> #B. From dataset with all individuals
> snps_fil_hwe_high <- Filter(snps_unind, filterOptions(minQ=30, max.missing = 0.7, maf=0.05, min.meanDP=20, max.meanDP=200, hwe=0.0001)) 

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
    --vcf /tmp/Rtmp6OKUzR/vcfLink_filede13332c973
    --maf 0.05
    --max-alleles 2
    --max-meanDP 200
    --hwe 0.0001
    --min-meanDP 20
    --minQ 30
    --max-missing 0.7
    --out /tmp/Rtmp6OKUzR/vcfLink_filede1363dbb43
    --recode

After filtering, kept 16 out of 16 Individuals
Outputting VCF file...
After filtering, kept 0 out of a possible 75609 Sites
No data left for analysis!
Run Time = 0.00 seconds
caseywdunn commented 3 years ago

But the issue looks to be upstream of that, step A returns 0 snps.

> #A. From dataset with low missing by individuals
> snps_fil_hwe_low <- Filter(snps_ind_pc, filterOptions(minQ=30, max.missing = 0.7, maf=0.05, min.meanDP=20, max.meanDP=200, hwe=0.0001)) 

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
    --vcf /tmp/Rtmp6OKUzR/vcfLink_filede13332c973
    --maf 0.05
    --max-alleles 2
    --max-meanDP 200
    --hwe 0.0001
    --min-meanDP 20
    --minQ 30
    --max-missing 0.7
    --out /tmp/Rtmp6OKUzR/vcfLink_filede12bf2fc04
    --recode

After filtering, kept 16 out of 16 Individuals
Outputting VCF file...
After filtering, kept 0 out of a possible 75609 Sites
No data left for analysis!
Run Time = 0.00 seconds
> VCFsummary(snps_fil_hwe_low) #16 individuals and 0 SNPs.
16 individuals and 0 SNPs.
caseywdunn commented 3 years ago

I played with the filter settings a bit. Retaining only minQ is sufficient to reject the reads. Lowering it to 0 retains all the reads. So a setting of 30 was rejecting all the reads. You should figure out what quality score distribution you have and pick a sensible value based on that.

cc @lemellenthin

> snps_fil_hwe_low <- Filter(snps_ind_pc, filterOptions(minQ=30, max.missing = 0.7, maf=0.05, min.meanDP=20, max.meanDP=200, hwe=0.0001))

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
    --vcf /tmp/Rtmp6OKUzR/vcfLink_filede13332c973
    --maf 0.05
    --max-alleles 2
    --max-meanDP 200
    --hwe 0.0001
    --min-meanDP 20
    --minQ 30
    --max-missing 0.7
    --out /tmp/Rtmp6OKUzR/vcfLink_filede16538a43a
    --recode

After filtering, kept 16 out of 16 Individuals
Outputting VCF file...
After filtering, kept 0 out of a possible 75609 Sites
No data left for analysis!
Run Time = 0.00 seconds
> snps_fil_hwe_low <- Filter(snps_ind_pc, filterOptions(minQ=30))

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
    --vcf /tmp/Rtmp6OKUzR/vcfLink_filede13332c973
    --minQ 30
    --out /tmp/Rtmp6OKUzR/vcfLink_filede16fdc684a
    --recode

After filtering, kept 16 out of 16 Individuals
Outputting VCF file...
After filtering, kept 0 out of a possible 75609 Sites
No data left for analysis!
Run Time = 1.00 seconds
> snps_fil_hwe_low <- Filter(snps_ind_pc, filterOptions(minQ=20))

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
    --vcf /tmp/Rtmp6OKUzR/vcfLink_filede13332c973
    --minQ 20
    --out /tmp/Rtmp6OKUzR/vcfLink_filede167a37ff2
    --recode

After filtering, kept 16 out of 16 Individuals
Outputting VCF file...
After filtering, kept 0 out of a possible 75609 Sites
No data left for analysis!
Run Time = 0.00 seconds
> snps_fil_hwe_low <- Filter(snps_ind_pc, filterOptions(minQ=0))

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
    --vcf /tmp/Rtmp6OKUzR/vcfLink_filede13332c973
    --minQ 0
    --out /tmp/Rtmp6OKUzR/vcfLink_filede13e09842f
    --recode

After filtering, kept 16 out of 16 Individuals
Outputting VCF file...
After filtering, kept 75609 out of a possible 75609 Sites
Run Time = 0.00 seconds
caseywdunn commented 3 years ago

All the quality scores are set to 13. So it appears that there is essentially no quality information and that a default value has been entered at some point:

> q = Query(snps_ind_pc, c("site-quality"))
> min(q$QUAL)
[1] 13
> max(q$QUAL)
[1] 13

I suggest removing quality filtering, eg:

snps_fil_hwe_low <- Filter(snps_ind_pc, filterOptions(max.missing = 0.7, maf=0.05, min.meanDP=20, max.meanDP=200, hwe=0.0001)) 
Dahn-YoungDong commented 3 years ago

Thank you, I will check it out. Happy Thanksgiving

On Thu, Nov 26, 2020 at 8:44 AM Casey Dunn notifications@github.com wrote:

All the quality scores are set to 13. So it appears that there is essentially no quality information and that a default value has been entered at some point:

q = Query(snps_ind_pc, c("site-quality")) min(q$QUAL) [1] 13 max(q$QUAL) [1] 13

I suggest removing quality filtering, eg:

snps_fil_hwe_low <- Filter(snps_ind_pc, filterOptions(max.missing = 0.7, maf=0.05, min.meanDP=20, max.meanDP=200, hwe=0.0001))

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dennisdong123/finalproject/issues/4#issuecomment-734336624, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANUSUW5UPIHMELOQDGWOTALSRZSWLANCNFSM4UBWC27Q .

-- Dahn-Young (Young) Dong [image: A button with "Hear my name" text for name playback in email signature] https://www.name-coach.com/danyang-dong