brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
254 stars 35 forks source link

How're `gt_depth_mean` and `depth_mean` calculated? #106

Open SHuang-Broad opened 1 year ago

SHuang-Broad commented 1 year ago

Hi,

As we continue to test out somalier for QC on our long reads data (currently for sex check), we are seeing something a bit strange.

The sample we ran this on has an estimated coverage of 10 by mosdepth (single CCS SMRT cell sample). However, as you can see from the table below, the gt_depth_mean and depth_mean are estimated to be approximately 1.5 of that,

family_id | sample_id | paternal_id | maternal_id | sex | phenotype | original_pedigree_sex | gt_depth_mean | gt_depth_sd | depth_mean | depth_sd | ab_mean | ab_std | n_hom_ref | n_het | n_hom_alt | n_unknown | p_middling_ab | X_depth_mean | X_n | X_hom_ref | X_het | X_hom_alt | Y_depth_mean | Y_n

-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- family | 1557026 | paternal | maternal | 1 | -9 | male | 15.3 | 6.4 | 14.6 | 6.8 | 0.43 | 0.56 | 4601 | 3045 | 6052 | 3686 | 0.013 | 10.54 | 178 | 83 | 0 | 95 | 11.57 | 7

whereas the X_depth_mean and Y_depth_mean is roughly in line with the mosdepth estimate.

So, as you can foretell, this makes the sex check look confusing

Screen Shot 2022-09-13 at 11 12 06 AM

I also run samtools depth on the sites VCF locations, and the mean from that is also 10.

So I'm wondering how gt_depth_mean and depth_mean are estimated in somalier?

Thanks, Steve

brentp commented 1 year ago

Hi, I see what you mean. Can you run somalier relate with --min-depth 5 and let me know if the result matches more what you expect? With a cutoff of 7 for a sample with mean depth of 10, I suspect many lower depths are getting skipped and this shifts the mean depth of (what I call) genotyped sites up. Indeed, this is a problem, but should be easily fixed.

SHuang-Broad commented 1 year ago

I actually used --min-depth 4 first, and here's the table

family_id | sample_id | paternal_id | maternal_id | sex | phenotype | original_pedigree_sex | gt_depth_mean | gt_depth_sd | depth_mean | depth_sd | ab_mean | ab_std | n_hom_ref | n_het | n_hom_alt | n_unknown | p_middling_ab | X_depth_mean | X_n | X_hom_ref | X_het | X_hom_alt | Y_depth_mean | Y_n

-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- family | 1557026 | paternal | maternal | 1 | -9 | male | 14.7 | 6.7 | 14.6 | 6.8 | 0.53 | 0.45 | 4948 | 3177 | 6668 | 2591 | 0.013 | 10.54 | 178 | 83 | 0 | 95 | 11.57 | 7

You can see the two fields are similar to the table I shared above.

brentp commented 1 year ago

Indeed, that rules out my guess. Is there any chance that you can share the extracted .somalier file?

I think it is odd that there are so many n_hom_alt,

Somalier is tuned for higher coverage samples, but I'm not seeing anything else in your output or in the code that would stop it from working on your samples.

I will also check that the cigar stuff is correct. It could be that it's close enough for illumina reads, but for long reads, whatever bug is triggered enough to be a problem.

SHuang-Broad commented 1 year ago

Thanks for this list of detailed suggestions. Here are my answers:

We also ran DeepVariant on this sample. Running the following to count PASSing SNPs, we've found 1.76 million HOM_ALTs and 3.03 million HETs overall, which is not the same het/hom_alt ratio as in the somalier tables.

zcat < 1557026.deepvariant_pepper.vcf.gz  | grep -v "^#" | grep -F 'PASS' | awk 'BEGIN {OFS="\t"} {a=length($4); b=length($5); if (a==1 && b==1) print}' | grep -c "1/1"

Regarding the .somalier file, I'm afraid I cannot share this as the samples are protected. If you can pin down which part of the file you'd be interested in, I can probably share that part.

Thanks for jumping on this so quickly! Steve

brentp commented 1 year ago

Thanks for the clear report. It seems likely that there's something with somalier. I expect it's easily fixed, but I'll have to find the problem. I'll check for a few long-read samples to run it on and see if I can replicate.

If you have a multi-sample DeepVariant VCF with this sample, I'd be interested to see if those results are better. If it is single-sample, there are other problems with somalier not having enough info about hom-ref.

SHuang-Broad commented 1 year ago

Unfortunately, it's not a joint VCF yet.

Can you elaborate on this, I'm not quite getting the point:

there are other problems with somalier not having enough info about hom-ref

brentp commented 1 year ago

Can you elaborate on this, I'm not quite getting the point:

there are other problems with somalier not having enough info about hom-ref

Actually, that's just for relatedness stuff, not for single-sample information, so you could try running on the single-sample VCF file.

SHuang-Broad commented 1 year ago

Hi Brent,

Sorry for dropping the ball on this. I followed your advice and ran with the single sample DeepVariant VCF, and here's the table, which give the same inference

family_id | sample_id | paternal_id | maternal_id | sex | phenotype | original_pedigree_sex | gt_depth_mean | gt_depth_sd | depth_mean | depth_sd | ab_mean | ab_std | n_hom_ref | n_het | n_hom_alt | n_unknown | p_middling_ab | X_depth_mean | X_n | X_hom_ref | X_het | X_hom_alt | Y_depth_mean | Y_n

-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- family | 1557026 | paternal | maternal | 1 | -9 | male | 11.2 | 3.0 | 5.7 | 5.8 | 0.33 | 0.52 | 2136 | 3849 | 1444 | 9955 | 0.002 | 8.16 | 37 | 16 | 0 | 21 | 9.00 | 2

Changing the depth requirement to 4 didn't change things materially. Both, unsurprisingly, generate a HET to HOM_ALT ratio closer to the WGS ratio (duh).

family_id | sample_id | paternal_id | maternal_id | sex | phenotype | original_pedigree_sex | gt_depth_mean | gt_depth_sd | depth_mean | depth_sd | ab_mean | ab_std | n_hom_ref | n_het | n_hom_alt | n_unknown | p_middling_ab | X_depth_mean | X_n | X_hom_ref | X_het | X_hom_alt | Y_depth_mean | Y_n

-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- family | 1557026 | paternal | maternal | 1 | -9 | male | 10.5 | 3.4 | 5.7 | 5.8 | 0.44 | 0.37 | 2483 | 4263 | 1642 | 8996 | 0.002 | 8.16 | 37 | 16 | 0 | 21 | 9.00 | 2

Thanks, Steve

Screen Shot 2022-09-15 at 2 52 17 PM
brentp commented 1 year ago

I'm really confused about what's going on. This is showing a much lower depth from DeepVariant calls. The ratio of homalts is dramatically reduced from DV indicating there could be a problem with the somalier genotyping on long reads. But, the ratio of autosomal to chrX, chrY depth is relatively unchanged. So, I think two things are true:

  1. something is odd about the coverage on your sample. I would look at the mosdepth summary output and see if you have some trisomies. you can use covdist or your own plotting to look at the output in more ... depth.
  2. there is a bug in the genotyping in somalier for long reads (this doesn't affect your results related to sex vs autosome depth, but I will look into it).
SHuang-Broad commented 1 year ago

I need to re-run mosdepth on that sample to get the plots, but here's the summary from the previous run

chrom   length  bases   mean    min max
chr1    248956422   2760787872  11.09   0   7801
chr2    242193529   2618785137  10.81   0   361
chr3    198295559   2156666022  10.88   0   1947
chr4    190214555   2199934962  11.57   0   5891
chr5    181538259   1954884959  10.77   0   1477
chr6    170805979   1868390025  10.94   0   259
chr7    159345973   1725551069  10.83   0   891
chr8    145138636   1584893778  10.92   0   642
chr9    138394717   1306272197  9.44    0   199
chr10   133797422   1514116404  11.32   0   1229
chr11   135086622   1467878253  10.87   0   90
chr12   133275309   1422786382  10.68   0   627
chr13   114364328   1138077672  9.95    0   953
chr14   107043718   957978895   8.95    0   91
chr15   101991189   894579432   8.77    0   112
chr16   90338345    944218678   10.45   0   4696
chr17   83257441    870271120   10.45   0   1345
chr18   80373285    875065734   10.89   0   121
chr19   58617616    538146550   9.18    0   94
chr20   64444167    718687371   11.15   0   369
chr21   46709983    433343387   9.28    0   374
chr22   50818468    402498377   7.92    0   1216
chrX    156040895   832630526   5.34    0   155
chrY    57227415    225360913   3.94    0   2369

It doesn't look like it's trisomy.

brentp commented 1 year ago

So, mosdepth shows about the expected coverage ratios for a male, though the X is a bit high.

DeepVariant+somalier does not so it's not a problem to specific to how somalier is counting. So, perhaps there is something about the sites in the autosome chosen by somalier that are biased to have lower coverage, or the sites on chrX or chrY are biased to have higher coverage in CCS. The autosome is entirely coding sites, but the sites on chrX and chrY are any common sites, not necessarily coding.

Somalier does use sites near start of chromosomes X and Y, so maybe there is high coverage near the telomeres?

@zeeev, are there cases where CCS has bias to lower coverage in high-GC (in genes)?

SHuang-Broad commented 1 year ago

quick responses

So, perhaps there is something about the sites in the autosome chosen by somalier that are biased to have lower coverage

I actually checked the coverage (with samtools depth) on these SNP sites (the released VCF), the mean coverage comes back to be 10 as well.

Somalier does use sites near start of chromosomes X and Y, so maybe there is high coverage near the telomeres?

Are PARs excluded in these sites?


another curious discovery, regarding sex "bias"?

HG00514

Based on the table posted in my previous ticket #105 (though the table posted there is malformed, corrected below)

family_id | sample_id | paternal_id | maternal_id | sex | phenotype | original_pedigree_sex | gt_depth_mean | gt_depth_sd | depth_mean | depth_sd | ab_mean | ab_std | n_hom_ref | n_het | n_hom_alt | n_unknown | p_middling_ab | X_depth_mean | X_n | X_hom_ref | X_het | X_hom_alt | Y_depth_mean | Y_n

-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- chs | UnnamedSample | 0 | 0 | 2 | 0 | female | 26.6 | 10.9 | 26.5 | 11.0 | 0.51 | 0.43 | 4964 | 3826 | 6122 | 2472 | 0.010 | 24.57 | 324 | 115 | 74 | 135 | 0.00 | 0 |  

The coverage estimation doesn't look off for that sample (~30X), yet it's the same scenario where HOM.ALT calls are way more than HET calls.

Note this is HG00514, a female sample. Is this related to the sex of the sample?

Check on more protected samples

So I checked the results on two more protected samples, one female and one male, both around 10X. Interestingly, the two male samples are seeing the inflated coverage (ratio 1.5), but not for the female sample.

Check on HG002

This is a male sample, the bam is released by GIAB/NIST https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb_20kb_chemistry2/GRCh38/

and mosdepth gives the following coverage summary

chr1    248956422   13969946903 56.11   0   28904
chr2    242193529   13700025724 56.57   0   2582
chr3    198295559   11329262682 57.13   0   16236
chr4    190214555   11662996790 61.31   0   40877
chr5    181538259   10261322700 56.52   0   9964
chr6    170805979   9861766821  57.74   0   7307
chr7    159345973   8824929365  55.38   0   1078
chr8    145138636   8252258159  56.86   0   2810
chr9    138394717   6700195676  48.41   0   719
chr10   133797422   7677494877  57.38   0   6042
chr11   135086622   7504549211  55.55   0   984
chr12   133275309   7441148334  55.83   0   668
chr13   114364328   5793614946  50.66   0   1978
chr14   107043718   5054143320  47.22   0   1580
chr15   101991189   4655047530  45.64   0   661
chr16   90338345    4784597624  52.96   0   27083
chr17   83257441    4376055735  52.56   0   1802
chr18   80373285    4457796186  55.46   0   647
chr19   58617616    2746694349  46.86   0   734
chr20   64444167    3634695784  56.40   0   3900
chr21   46709983    2339692268  50.09   0   3092
chr22   50818468    2052775521  40.39   0   26160
chrX    156040895   4369507111  28.00   0   305
chrY    57227415    1323232774  23.12   0   18494
...
total   3099844859  164920005749    53.20   0   73360

What I find is somewhat following this "sex-bias" trend (if we can call it that way). The scaled depth on X and Y are—correctly—both around 1. But look at the table below

family_id | sample_id | paternal_id | maternal_id | sex | phenotype | original_pedigree_sex | gt_depth_mean | gt_depth_sd | depth_mean | depth_sd | ab_mean | ab_std | n_hom_ref | n_het | n_hom_alt | n_unknown | p_middling_ab | X_depth_mean | X_n | X_hom_ref | X_het | X_hom_alt | Y_depth_mean | Y_n

-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- family | HG002 | paternal | maternal | 1 | -9 | male | 75.7 | 25.3 | 75.7 | 25.3 | 0.51 | 0.40 | 4553 | 5112 | 5627 | 2092 | 0.004 | 36.66 | 353 | 170 | 0 | 183 | 37.73 | 15

Here's the somalier file for HG002

HG002.somalier.zip

This is really puzzling now.

:thinking:

brentp commented 1 year ago

I think there is some bias with higher coverage near start of the chromosomes. I will look into this next week. Thanks very much for following up. Should have a fix soon. Can you verify which sites file you used for somalier?

SHuang-Broad commented 1 year ago

It's this on the release page

https://github.com/brentp/somalier/files/3412456/sites.hg38.vcf.gz

I should also mention that we are using the no_alt version of GRCh38

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

And thank you for following through on this! Steve

brentp commented 1 year ago

you could also try: https://github.com/brentp/somalier/files/4566475/sites.hg38.rna.vcf.gz but I will have a look at this problem.

brentp commented 1 year ago

I had a look and didnt find anything obvious, but I'll look again next week. I rewrote the pileup and got the same answer (and verified it was correct), so I don't think that's the issue.

SHuang-Broad commented 1 year ago

Thanks, Brent!

I'm a little confused by this

I rewrote the pileup and got the same answer (and verified it was correct)

Could you also try your new code on the GIAB-HG002 and see? https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb_20kb_chemistry2/GRCh38/

I'm observing inflated coverage estimation on that BAM by somalier (~1.5X inflation, see above).

Thanks! Steve

brentp commented 1 year ago

I mean that it's counting alleles correctly, but I can see the issue you mention. So I'm still confused.