Missing variants in process_cohort .tsv output (process_cohort)

McClatcheyM commented 1 year ago

I am working on a small rare diseases clinical diagnostics research project with human WGS (Cardiff, UK) We've already run the snakemake pipeline (minus Hifasm assembly) for one trio family with two more families currently being processed.

After reviewing the output data from process_cohort for the cohort/{cohort}.{assembly}.deepvariant.phased.slivar.tsv (and comp. het file) of the first family, we have noticed that a positive control variant (from prior SRS-WGS on 2 platforms) is absent in the .tsv. This variant is a maternally-inherited X-linked 4-bp insertion (indel), predicted to cause a frameshift in a gene BCOR.

The variant is present in the filtered phased.slivar.vcf.gz and is visible in IGV in both the proband and mother from the .bam files. However, it appears to have been excluded from the parsed .tsv output. I'm concerned about why this variant is being filtered from the final output. It appears to meet the required filtering criteria already. We're obviously aware of this positive control but it doesn't exclude the possibility that other variants that might have been filtered out. I've attached the VCF header + variant in question for review (all de-identified) [GRCh38]. Sample order in FORMAT = Father - Mother - Proband

I'm wondering whether this is a simple parsing filter issue but can't determine the cause from the cohort_slivar.smk rule. I note that the annotation includes multiple transcript outputs (9 in total) and the last of these is a 5'UTR csq but am not sure whether this would be a cause of the issue. Obviously the canonical/MANE transcript and majority are high impact, so I would expect the variant to have remained in the .tsv file.

Any advice would be much appreciated as the rest of the pipeline seems to have worked and we are keen to have accurate output for the purposes of research diagnostics. Please let me know if you require any further information

Best Wishes, Martin BCOR_var.txt

williamrowell commented 1 year ago

Thanks for the detail. It really helps us look into things like this.

I can spot why this didn't make it into the TSV, but it will take some explanation. The first call to slivar tsv only acts on variants that have been flagged as dominant, x_dominant, recessive, or x_recessive. The rule for x_dominant, as taken from line 42 of your VCF header, is:

(variant.CHROM == chrX) \
&& fam.every(segregating_dominant_x) \
&& INFO.gnomad_ac <= 4 \
&& INFO.hprc_ac <= 4

segregating_dominant_x is:

function segregating_dominant_x(s) {
  // this is an internal function only called after checking sample quality and on X
...
  // skipping some lines pertaining to males
...
  // this block enforces inherited dominant, but not find de novos
  if(("mom" in s) || ("dad" in s)) {
    if(!((("mom" in s) && s.mom.affected && s.mom.het) || (s.dad && s.dad.affected))) { return false;}
    if(("dad" in s) && !hq1(s.dad)){ return false; }  // <-- dad must be "high quality"
    if(("mom" in s) && !hq1(s.mom)){ return false; }
  }
...

I think it might be failing the hq1(s.dad) expression on line 87.

The function that checks for sample quality has a minimum GQ of 20. The dad.GQ for this variant is 18, so it's below the quality check.

GT:DP:AD:GQ:PL:RNC:PS   0/0:16:16,0:18:0,18,419:..:.

I think the GQ is only low here because dad is hemizygous and only has ~50% coverage depth. I'd be interested to get an opinion from @brentp on this. Does it make sense to accept lower dad.GQ for segregating_dominant_x? Are there other features we could use in segregating_dominant_x to avoid filtering hemizygous variants?

Also for @brentp (hopefully I'm not annoying you by summoning you), looks like I shouldn't be calling segregating_dominant_x directly. Seems like it should be redundant with segregating_dominant. I don't think that is related to losing this variant, but seems like something I should take care of.

McClatcheyM commented 1 year ago

This is really helpful feedback also and mostly makes sense to me. Thanks again. I hadn't checked the underlying functions from the slivar_function.js I also suspect that the hemizygous DP of the paternal sample is likely to pose a problem for the GQ.

FYI, I think these issues are part of a wider/global problem with filtering of these segregating_dominant_x variants in clinical analyses. Indeed, we'd picked up this variant in SR-WGS from the 100,000 Genomes Project originally, as it was reported to us by Genomics England. The patient then underwent a second round of SR-WGS (also Illumina) but through our local clinical diagnostics lab who routinely mask these segregating X-linked dominant variants to minimise analysis for the clinical scientists.

From the research clinical diagnostics perspective we'd quite like to keep these variants included in an end-output human-readable .tsv . Many of my colleagues don't have the informatics skills to dive in to the VCF but can easily filter through a text file in a GUI spreadsheet package or something similar.

@williamrowell I should've raised it previously but slivar_function.js wasn't included in the resources sub-directory that were specified in config.yaml. The resources contents were provided from our regional contact to our sequencing lab. I had to import the slivar jscript independently as a result. I suspect this was an old version of the 'resources' sub-directory however, so perhaps it isn't an issue for others now but I thought I'd make you aware

brentp commented 1 year ago

thanks for pointing this out. I think you're right that you don't need to call segregating_dominant_x directly. It's called by segregating_dominant.

I'm looking into a straight-forward way to have a lower GQ for males on chrX.

williamrowell commented 1 year ago

@brentp

I'm looking into a straight-forward way to have a lower GQ for males on chrX.

We'd need that for chrY as well, right?

brentp commented 1 year ago

you might be on your own for chrY. Or PR accepted. :)

brentp commented 1 year ago

I pushed a fix for this that defaults the male GQ cutoff on X to 10 and fixes some other bugs with segregating_dominant_x.

williamrowell commented 1 year ago

@McClatcheyM

From the research clinical diagnostics perspective we'd quite like to keep these variants included in an end-output human-readable .tsv . Many of my colleagues don't have the informatics skills to dive in to the VCF but can easily filter through a text file in a GUI spreadsheet package or something similar.

I completely understand. Tertiary analysis is tricky in a workflow like this, because it isn't one-size-fits-all. Every team, and even sometimes individual projects within a same team, has different requirements. For instance, we wrote our implementation to work with singletons, duos, trios, or arbitrary "cohorts". This means that we aren't using the more specific slivar expr --trio expressions that might result in better filtering for trios. Generalization also complicates our compound het analysis a little bit.

It sounds like it would beneficial for your group to customize the rules to fit your use case. We're currently using the command below (although I'm thinking of altering it slightly as a result of the conversations in this thread). I've interspersed some comments to explain what you might want to change.

# `pslivar --process {threads} --fasta {input.ref}` is interchangeable with `slivar expr`
# if you don't mind running single-threaded
pslivar --processes {threads} \
    --fasta {input.ref}\
    --pass-only \
# slivar-functions.js has some minimum values on line 1 that you might want to change
    --js slivar-functions.js \
# if you think our frequency cutoffs are to high or low
# you can adjust them on the next few lines
    --info 'variant.FILTER=="PASS" && INFO.gnomad_af <= 0.01 && INFO.hprc_af <= 0.01 && INFO.gnomad_nhomalt <= 4 && INFO.hprc_nhomalt <= 4' \
    --family-expr 'recessive:fam.every(segregating_recessive)' \
    --family-expr 'x_recessive:(variant.CHROM == "chrX") && fam.every(segregating_recessive_x)' \
    --family-expr 'dominant:fam.every(segregating_dominant) && INFO.gnomad_ac <= 4 && INFO.hprc_ac <= 4' \
    --family-expr 'x_dominant:(variant.CHROM == "chrX") && fam.every(segregating_dominant_x)  && INFO.gnomad_ac <= 4 && INFO.hprc_ac <= 4' \
    --sample-expr 'comphet_side:sample.het && sample.GQ > 5' \
    --gnotate {input.gnomad_af} \
    --gnotate {input.hprc_af} \
    --vcf {input.bcf} \
    --ped {input.ped} \
    | bcftools csq -l -s - --ncsq 40 \
        -g {input.gff} -f {input.ref} - -o {output}

Only variants that satisfy all of the --info filters and at least one of the expressions (recessive, x_recessive, dominant, x_dominant) will make it through slivar and into the annotated output VCF.

If you want to change the GQ/AB/DP thresholds, you can edit slivar-functions.js in your resources/slivar directory.
If you want to change the maximum AF or AC, you can edit config.yaml lines 56-62.
If you want to change the filter expressions, you'll need to edit them within rules/cohort_slivar.smk. If you remove --family-expr expressions or add new ones, you'll want to modify the slivar tsv command as well.

Since I stated writing this, @brentp has already introduced some fixes to address segregating_dominant_x. I'll integrate these in the next couple weeks, but I still think it might be useful for your group to consider customizing tertiary analysis to fit your needs.

@williamrowell I should've raised it previously but slivar_function.js wasn't included in the resources sub-directory that were specified in config.yaml.

Sorry about that. We've typically included slivar_function.js in the resources directory on the ftp server we've been pointing users at. Tarballing references and resources and putting them on Zenodo has been on the whiteboard for a while, but I think it might get pushed back until we finish porting this workflow to WDL.

williamrowell commented 1 year ago

@brentp

Just want to make sure I understand this correctly. If I comment out the lines 86-91 in segregating_dominant_x(), will this function also return de novos?

 // this block enforces inherited dominant, but not find de novos
   if(("mom" in s) || ("dad" in s)) {
     // false if affected mother isn't het
     // false if father is affected
     if(!((("mom" in s) && s.mom.affected && s.mom.het) || (s.dad && s.dad.affected))) { return false;}
     // false if either parent isn't hq
     if(("dad" in s) && !hq1(s.dad, true)){ return false; }
     if(("mom" in s) && !hq1(s.mom, true)){ return false; }
   }

brentp commented 1 year ago

I don't think commenting that out will make it return de novos.

williamrowell commented 1 year ago

Sorry, I worded that poorly. Will it prevent the function from excluding X-linked de novos?

brentp commented 1 year ago

I think that block is just finding x-linked dominant. But it won't find de novos.

PacificBiosciences / pb-human-wgs-workflow-snakemake

Missing variants in process_cohort .tsv output (process_cohort) #127