Open SHuang-Broad opened 1 year ago
Hi, I see what you mean. Can you run somalier relate
with --min-depth 5
and let me know if the result matches more what you expect? With a cutoff of 7 for a sample with mean depth of 10, I suspect many lower depths are getting skipped and this shifts the mean depth of (what I call) genotyped sites up. Indeed, this is a problem, but should be easily fixed.
I actually used --min-depth 4
first, and here's the table
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- family | 1557026 | paternal | maternal | 1 | -9 | male | 14.7 | 6.7 | 14.6 | 6.8 | 0.53 | 0.45 | 4948 | 3177 | 6668 | 2591 | 0.013 | 10.54 | 178 | 83 | 0 | 95 | 11.57 | 7
You can see the two fields are similar to the table I shared above.
Indeed, that rules out my guess. Is there any chance that you can share the extracted .somalier file?
I think it is odd that there are so many n_hom_alt
,
--mapq 1
to re check depth? (default is to count mapping quality 0 reads)Somalier is tuned for higher coverage samples, but I'm not seeing anything else in your output or in the code that would stop it from working on your samples.
I will also check that the cigar stuff is correct. It could be that it's close enough for illumina reads, but for long reads, whatever bug is triggered enough to be a problem.
Thanks for this list of detailed suggestions. Here are my answers:
somalier
on minimap2 generated BAMs.v0.2.15
, and the docker version is the same.mosdepth
command: mosdepth -x -n -Q1 ~{prefix} ~{bam}
We also ran DeepVariant on this sample.
Running the following to count PASSing SNPs, we've found 1.76 million HOM_ALTs and 3.03 million HETs overall, which is not the same het/hom_alt ratio as in the somalier
tables.
zcat < 1557026.deepvariant_pepper.vcf.gz | grep -v "^#" | grep -F 'PASS' | awk 'BEGIN {OFS="\t"} {a=length($4); b=length($5); if (a==1 && b==1) print}' | grep -c "1/1"
Regarding the .somalier file, I'm afraid I cannot share this as the samples are protected. If you can pin down which part of the file you'd be interested in, I can probably share that part.
Thanks for jumping on this so quickly! Steve
Thanks for the clear report. It seems likely that there's something with somalier. I expect it's easily fixed, but I'll have to find the problem. I'll check for a few long-read samples to run it on and see if I can replicate.
If you have a multi-sample DeepVariant VCF with this sample, I'd be interested to see if those results are better. If it is single-sample, there are other problems with somalier not having enough info about hom-ref.
Unfortunately, it's not a joint VCF yet.
Can you elaborate on this, I'm not quite getting the point:
there are other problems with somalier not having enough info about hom-ref
Can you elaborate on this, I'm not quite getting the point:
there are other problems with somalier not having enough info about hom-ref
Actually, that's just for relatedness stuff, not for single-sample information, so you could try running on the single-sample VCF file.
Hi Brent,
Sorry for dropping the ball on this. I followed your advice and ran with the single sample DeepVariant VCF, and here's the table, which give the same inference
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- family | 1557026 | paternal | maternal | 1 | -9 | male | 11.2 | 3.0 | 5.7 | 5.8 | 0.33 | 0.52 | 2136 | 3849 | 1444 | 9955 | 0.002 | 8.16 | 37 | 16 | 0 | 21 | 9.00 | 2
Changing the depth requirement to 4 didn't change things materially. Both, unsurprisingly, generate a HET to HOM_ALT ratio closer to the WGS ratio (duh).
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- family | 1557026 | paternal | maternal | 1 | -9 | male | 10.5 | 3.4 | 5.7 | 5.8 | 0.44 | 0.37 | 2483 | 4263 | 1642 | 8996 | 0.002 | 8.16 | 37 | 16 | 0 | 21 | 9.00 | 2
Thanks, Steve
I'm really confused about what's going on. This is showing a much lower depth from DeepVariant calls. The ratio of homalts is dramatically reduced from DV indicating there could be a problem with the somalier genotyping on long reads. But, the ratio of autosomal to chrX, chrY depth is relatively unchanged. So, I think two things are true:
I need to re-run mosdepth
on that sample to get the plots, but here's the summary from the previous run
chrom length bases mean min max
chr1 248956422 2760787872 11.09 0 7801
chr2 242193529 2618785137 10.81 0 361
chr3 198295559 2156666022 10.88 0 1947
chr4 190214555 2199934962 11.57 0 5891
chr5 181538259 1954884959 10.77 0 1477
chr6 170805979 1868390025 10.94 0 259
chr7 159345973 1725551069 10.83 0 891
chr8 145138636 1584893778 10.92 0 642
chr9 138394717 1306272197 9.44 0 199
chr10 133797422 1514116404 11.32 0 1229
chr11 135086622 1467878253 10.87 0 90
chr12 133275309 1422786382 10.68 0 627
chr13 114364328 1138077672 9.95 0 953
chr14 107043718 957978895 8.95 0 91
chr15 101991189 894579432 8.77 0 112
chr16 90338345 944218678 10.45 0 4696
chr17 83257441 870271120 10.45 0 1345
chr18 80373285 875065734 10.89 0 121
chr19 58617616 538146550 9.18 0 94
chr20 64444167 718687371 11.15 0 369
chr21 46709983 433343387 9.28 0 374
chr22 50818468 402498377 7.92 0 1216
chrX 156040895 832630526 5.34 0 155
chrY 57227415 225360913 3.94 0 2369
It doesn't look like it's trisomy.
So, mosdepth shows about the expected coverage ratios for a male, though the X is a bit high.
DeepVariant+somalier does not so it's not a problem to specific to how somalier is counting. So, perhaps there is something about the sites in the autosome chosen by somalier that are biased to have lower coverage, or the sites on chrX or chrY are biased to have higher coverage in CCS. The autosome is entirely coding sites, but the sites on chrX and chrY are any common sites, not necessarily coding.
Somalier does use sites near start of chromosomes X and Y, so maybe there is high coverage near the telomeres?
@zeeev, are there cases where CCS has bias to lower coverage in high-GC (in genes)?
So, perhaps there is something about the sites in the autosome chosen by somalier that are biased to have lower coverage
I actually checked the coverage (with samtools depth
) on these SNP sites (the released VCF), the mean coverage comes back to be 10 as well.
Somalier does use sites near start of chromosomes X and Y, so maybe there is high coverage near the telomeres?
Are PARs excluded in these sites?
Based on the table posted in my previous ticket #105 (though the table posted there is malformed, corrected below)
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- chs | UnnamedSample | 0 | 0 | 2 | 0 | female | 26.6 | 10.9 | 26.5 | 11.0 | 0.51 | 0.43 | 4964 | 3826 | 6122 | 2472 | 0.010 | 24.57 | 324 | 115 | 74 | 135 | 0.00 | 0 |
The coverage estimation doesn't look off for that sample (~30X), yet it's the same scenario where HOM.ALT calls are way more than HET calls.
Note this is HG00514, a female sample. Is this related to the sex of the sample?
So I checked the results on two more protected samples, one female and one male, both around 10X. Interestingly, the two male samples are seeing the inflated coverage (ratio 1.5), but not for the female sample.
This is a male sample, the bam is released by GIAB/NIST https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb_20kb_chemistry2/GRCh38/
and mosdepth gives the following coverage summary
chr1 248956422 13969946903 56.11 0 28904
chr2 242193529 13700025724 56.57 0 2582
chr3 198295559 11329262682 57.13 0 16236
chr4 190214555 11662996790 61.31 0 40877
chr5 181538259 10261322700 56.52 0 9964
chr6 170805979 9861766821 57.74 0 7307
chr7 159345973 8824929365 55.38 0 1078
chr8 145138636 8252258159 56.86 0 2810
chr9 138394717 6700195676 48.41 0 719
chr10 133797422 7677494877 57.38 0 6042
chr11 135086622 7504549211 55.55 0 984
chr12 133275309 7441148334 55.83 0 668
chr13 114364328 5793614946 50.66 0 1978
chr14 107043718 5054143320 47.22 0 1580
chr15 101991189 4655047530 45.64 0 661
chr16 90338345 4784597624 52.96 0 27083
chr17 83257441 4376055735 52.56 0 1802
chr18 80373285 4457796186 55.46 0 647
chr19 58617616 2746694349 46.86 0 734
chr20 64444167 3634695784 56.40 0 3900
chr21 46709983 2339692268 50.09 0 3092
chr22 50818468 2052775521 40.39 0 26160
chrX 156040895 4369507111 28.00 0 305
chrY 57227415 1323232774 23.12 0 18494
...
total 3099844859 164920005749 53.20 0 73360
What I find is somewhat following this "sex-bias" trend (if we can call it that way). The scaled depth on X and Y are—correctly—both around 1. But look at the table below
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- family | HG002 | paternal | maternal | 1 | -9 | male | 75.7 | 25.3 | 75.7 | 25.3 | 0.51 | 0.40 | 4553 | 5112 | 5627 | 2092 | 0.004 | 36.66 | 353 | 170 | 0 | 183 | 37.73 | 15
Here's the somalier file for HG002
This is really puzzling now.
:thinking:
I think there is some bias with higher coverage near start of the chromosomes. I will look into this next week. Thanks very much for following up. Should have a fix soon. Can you verify which sites file you used for somalier?
It's this on the release page
https://github.com/brentp/somalier/files/3412456/sites.hg38.vcf.gz
I should also mention that we are using the no_alt version of GRCh38
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
And thank you for following through on this! Steve
you could also try: https://github.com/brentp/somalier/files/4566475/sites.hg38.rna.vcf.gz but I will have a look at this problem.
I had a look and didnt find anything obvious, but I'll look again next week. I rewrote the pileup and got the same answer (and verified it was correct), so I don't think that's the issue.
Thanks, Brent!
I'm a little confused by this
I rewrote the pileup and got the same answer (and verified it was correct)
Could you also try your new code on the GIAB-HG002 and see? https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb_20kb_chemistry2/GRCh38/
I'm observing inflated coverage estimation on that BAM by somalier
(~1.5X inflation, see above).
Thanks! Steve
I mean that it's counting alleles correctly, but I can see the issue you mention. So I'm still confused.
Hi,
As we continue to test out
somalier
for QC on our long reads data (currently for sex check), we are seeing something a bit strange.The sample we ran this on has an estimated coverage of 10 by
mosdepth
(single CCS SMRT cell sample). However, as you can see from the table below, thegt_depth_mean
anddepth_mean
are estimated to be approximately 1.5 of that,family_id | sample_id | paternal_id | maternal_id | sex | phenotype | original_pedigree_sex | gt_depth_mean | gt_depth_sd | depth_mean | depth_sd | ab_mean | ab_std | n_hom_ref | n_het | n_hom_alt | n_unknown | p_middling_ab | X_depth_mean | X_n | X_hom_ref | X_het | X_hom_alt | Y_depth_mean | Y_n
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- family | 1557026 | paternal | maternal | 1 | -9 | male | 15.3 | 6.4 | 14.6 | 6.8 | 0.43 | 0.56 | 4601 | 3045 | 6052 | 3686 | 0.013 | 10.54 | 178 | 83 | 0 | 95 | 11.57 | 7
whereas the
X_depth_mean
andY_depth_mean
is roughly in line with themosdepth
estimate.So, as you can foretell, this makes the sex check look confusing
I also run
samtools depth
on the sites VCF locations, and the mean from that is also 10.So I'm wondering how
gt_depth_mean
anddepth_mean
are estimated insomalier
?Thanks, Steve