brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
254 stars 35 forks source link

Error with find-sites #96

Open ashotmarg opened 2 years ago

ashotmarg commented 2 years ago

Hi Brent,

Thanks for Somalier, it's quite fast and straightforward to use! At the moment I am also trying to extract sites more tailored to my dataset with "find-sites". While it worked fine for one of my vcf files, it failed for another, which was produced from a different pipeline with slightly different INFO and FORMAT fields.

For the failed one I am getting the error below: I am running somalier from it's singularity image and this is just a chr21 extract of my main vcf.


$ singularity exec ~/bin/somalier.sif somalier find-sites --min-AN=2 --min-AF=0.2 --snp-dist=10 test.vcf INFO: Converting SIF file to temporary sandbox... somalier version: 0.2.15 on chrom:chr21 290 candidate variants tables.nim(262) [] Error: unhandled exception: key not found: chr21 [KeyError] INFO: Cleaning up image...


Could you elaborate for which fields in the vcf is "find-sites" looking for? My vcf seems to have the relevant ones such as AC, AF, AN... E.g., INFO and FORMAT fields: AC=2;AF=0.333;AN=6;DP=16;ExcessHet=1.0474;FS=0;MLEAC=1;MLEAF=0.167;QD=22;SOR=2.303 GT:AD:DP:GQ:PL

Thanks Ashot

brentp commented 2 years ago

Hi, thanks for reporting. Will you give it a try with this binary? It will give us more information about exactly where the error is occuring

somalier_dbg.gz

brentp commented 2 years ago

My apologies. I didn't overwrite a previous somalier_dbg I had. Please try this one. somalier_dbg.gz

ashotmarg commented 2 years ago

Thanks for the prompt feedback! Here is what I get with the whole vcf file.

$ ./somalier_dbg find-sites batch1.Sentieon.biSNPs.vcf.gz somalier version: 0.2.15 on chrom:chr1 on chrom:chr2 on chrom:chr3 on chrom:chr4 on chrom:chr5 on chrom:chr6 on chrom:chr7 on chrom:chr8 on chrom:chr9 on chrom:chr10 on chrom:chr11 on chrom:chr12 on chrom:chr13 on chrom:chr14 on chrom:chr15 on chrom:chr16 on chrom:chr17 on chrom:chr18 on chrom:chr19 on chrom:chr20 on chrom:chr21 on chrom:chr22 on chrom:chrX on chrom:chrY on chrom:chrM 160251 candidate variants /home/brentp/src/somalier/src/somalier.nim(266) somalier /home/brentp/src/somalier/src/somalier.nim(253) main /home/brentp/src/somalier/src/somalierpkg/findsites.nim(249) findsites_main /nim/lib/pure/collections/tables.nim(858) [] /nim/lib/pure/collections/tables.nim(262) [] Error: unhandled exception: key not found: chrX [KeyError]


Thanks in advance!

brentp commented 2 years ago

I see the problem. It is fixed in this binary somalier_dbg2.gz

I was assuming that the table would have entries for each chromosome which is obviously not going to happen in every case, including yours.

ashotmarg commented 2 years ago

So now the error message is gone but not sure it's working properly, most likely these things were also present in my original run, I just didn't check thoroughly. I seem to have SNPs which are not far (10000 bp) apart from each other after giving "--snp-dist=10000", there are many cases like this below. In addition I don't have any SNPs from e.g., chr8, and I know that there are many SNPs matching the parameters --min-AN=1 --min-AF=0.4 --snp-dist=10000 Also, in the sites.vcf.gz the program is still giving output from chrX/Y as well but they are not filtered...

chr7 4620726 . A C 100 . AC=3;AF=0.5 chr7 4620753 . T G 100 . AC=3;AF=0.5 chr7 4625241 . A G 100 . AC=3;AF=0.5 chrX 2781514 . C A 100 . AC=3;AF=0.5 chrX 2781635 . G A 100 . AC=1;AF=0.167 chrX 2781927 . A G 100 . AC=3;AF=0.5 chrX 2781986 . T C 100 . AC=3;AF=0.5 chrX 2782161 . A G 100 . AC=1;AF=0.167 chrX 2782261 . T A 100 . AC=1;AF=0.167 chrX 2782567 . C T 100 . AC=2;AF=0.333 chrX 2782572 . A G 100 . AC=3;AF=0.5

Here is the run command/output:


$ ./somalier_dbg2 find-sites --min-AN=1 --min-AF=0.4 --snp-dist=10000 batch1.Sentieon.biSNPs.vcf.gz somalier version: 0.2.15 on chrom:chr1 on chrom:chr2 on chrom:chr3 on chrom:chr4 on chrom:chr5 on chrom:chr6 on chrom:chr7 on chrom:chr8 on chrom:chr9 on chrom:chr10 on chrom:chr11 on chrom:chr12 on chrom:chr13 on chrom:chr14 on chrom:chr15 on chrom:chr16 on chrom:chr17 on chrom:chr18 on chrom:chr19 on chrom:chr20 on chrom:chr21 on chrom:chr22 on chrom:chrX on chrom:chrY on chrom:chrM 319120 candidate variants sorted and filtered to 152805 variants. now dropping INFOs and writing [somalier] wrote 65535 variants to:sites.vcf.gz


As a more general approach, as an alternative, I can also e.g., filter my vcf file for AC values I choose and then use "plink --indep" to remove linked sites, right?

ashotmarg commented 2 years ago

I will also try to just use a vcf with only autosomes, not sure if the sex chromosomes are interfering with the overall calculations.

brentp commented 2 years ago

Hi, sorry for your trouble. Don't use plink to remove sites, somalier should do that for you. Also, only the autosomes are used in the relatedness calculations, but the X and Y are used to validate sex. I see that I had some uncommitted changes in addition to the fix. Those previous changes are causing the problem you see. I think I see the easy fix, but want to try to verify so you're not running my debug cycle for me. If you give me a few hours, I think this will be resolved.

ashotmarg commented 2 years ago

Thanks Brent, no worries and sounds good ;)

brentp commented 2 years ago

OK. I think this will work correctly for you. I would lower the snp-dist and the min-af if you need to get more sites. somalier_dbg3.gz

ashotmarg commented 2 years ago

Thanks, we are getting close but probably not fully there yet! Please see the output below, in short I think we still have all the unfiltered sex chromosomes. I.e., the output should only have ca. 26000 sites, but there are ca. 176000 sites in the sites.vcf.gz. In total we have ca. 150000 sex chromosomes and they are unfiltered.


chrX 2782161 . A G 100 . AC=1;AF=0.167 chrX 2782261 . T A 100 . AC=1;AF=0.167



$ ./somalier_dbg3 find-sites --min-AN=1 --min-AF=0.4 --snp-dist=50000 batch1.Sentieon.biSNPs.vcf.gz somalier version: 0.2.16 on chrom:chr1 on chrom:chr2 on chrom:chr3 on chrom:chr4 on chrom:chr5 on chrom:chr6 on chrom:chr7 on chrom:chr8 on chrom:chr9 on chrom:chr10 on chrom:chr11 on chrom:chr12 on chrom:chr13 on chrom:chr14 on chrom:chr15 on chrom:chr16 on chrom:chr17 on chrom:chr18 on chrom:chr19 on chrom:chr20 on chrom:chr21 on chrom:chr22 on chrom:chrX on chrom:chrY on chrom:chrM 319120 candidate variants sorted and filtered to 27062 variants. now dropping INFOs and writing [somalier] wrote 27062 variants to:sites.vcf.gz [ashmar@i-07-l0003 results.batch1]$ less sites.vcf.gz |grep -v "^#"|wc 176682 1413456 7003099 [ashmar@i-07-l0003 results.batch1]$ less sites.vcf.gz |grep "^chrX"|wc 134008 1072064 5350439 [ashmar@i-07-l0003 results.batch1]$ less sites.vcf.gz |grep "^chrY"|wc 15612 124896 606859 [ashmar@i-07-l0003 results.batch1]$ less sites.vcf.gz |grep "^chr1"|wc 12048 96384 469129


brentp commented 2 years ago

Hi, it's intentionally getting more values on X to evaluate sex. This is too many, in your case, and I'll make a change for that, but it will not affect your relatedness calculations.

I'll update this message:

sorted and filtered to 27062 variants. now dropping INFOs and writing

to indicate autosomal.

Would these changes address your concerns?

brentp commented 2 years ago

The other thing to clarify is that the sex chromosome have different filtering because it's only for quality and linkage doesn't matter.

brentp commented 2 years ago

And here's another binary with those changes in case you want to try it. It only updates the message and caps the number of X and y variants it will use. somalier_dbg4.gz

ashotmarg commented 2 years ago

Super, thanks a lot Brent for the prompt response, I think it looks great! Yes, the X/Y chromosomes are now also filtered for quality.

ashotmarg commented 1 year ago

Hi Brent, I've been using the binary of the Somalier (somalier_dbg4.gz) that you shared with me (in this thread above), and it's the Somalier's version: 0.2.16. However, I realised that the latest version in the releases and the docker file is actually the v0.2.15. I know that there are only minor changes in the versions, but I was wondering whether you will also push that latest v0.2.16 to docker and github? Thanks, Ashot