PacificBiosciences / pb-human-wgs-workflow-snakemake

Workflow for the comprehensive detection and prioritization of variants in human genomes with PacBio HiFi reads
BSD 3-Clause Clear License
38 stars 20 forks source link

process_cohort output #151

Closed lauragails closed 1 year ago

lauragails commented 1 year ago

Hi Billy and team, I am going through my output from process_cohorts and see some weird snp names (along with many good ones)

grep "chr7:141584066:C:C" ./COHORT/svpack/COHORT.GRCh38.pbsv.svpack.tsv | cut -f1,4 hetalt chr7:141584066:C:C[chr4:91045814[ hetalt chr7:141584066:C:C[chr4:91045814[ hetalt chr7:141584066:C:C[chr4:91045814[

Likewise the genotype column (parsed, fed through sort | uniq -c to generate count) for this file is giving some results that I don't know how to parse (ie what does -1 mean in the mom column?):

count | genotype: sample | genotype: dad | genotype: mom -- | -- | -- | -- 1 | 1 | -1 | 2 1 | 1 | 2 | -1 1 | 2 | -1 | 1 1 | 2 | -1 | -1 1 | 2 | 1 | -1 1 | 2 | -1 | 2 2 | 2 | 0 | 2 3 | 2 | 0 | 1 3 | 2 | 2 | 0 5 | 1 | 1 | -1 6 | 1 | -1 | 1 6 | 1 | 2 | 2 23 | 1 | 0 | 0 37 | 2 | 2 | 1 38 | 1 | 0 | 2 39 | 1 | 2 | 0 43 | 1 | 1 | 2 43 | 1 | 2 | 1 43 | 2 | 1 | 2 63 | 2 | 1 | 1 79 | 2 | 2 | 2 385 | 1 | 1 | 0 404 | 1 | 0 | 1 435 | 2 | . | . 595 | 1 | 1 | 1 3145 | 1 | . | .

Thank you!

williamrowell commented 1 year ago

grep "chr7:141584066:C:C" ./COHORT/svpack/COHORT.GRCh38.pbsv.svpack.tsv | cut -f1,4 hetalt chr7:141584066:C:C[chr4:91045814[ hetalt chr7:141584066:C:C[chr4:91045814[ hetalt chr7:141584066:C:C[chr4:91045814[

These are BND entries, as described by the VCF specification:

image

So, in this example, the REF is C, and the ALT is C followed by bases starting at chr4:91045814 and continuing on the + strand.

Likewise the genotype column (parsed, fed through sort | uniq -c to generate count) for this file is giving some results that I don't know how to parse (ie what does -1 mean in the mom column?):

count genotype: sample genotype: dad genotype: mom 1 1 -1 2 1 1 2 -1

slivar tsv expresses genotypes in an odd way. slivar tsv genotype genotype
-1 ./.
0 0/0
1 0/1
2 1/1

. seems to be linked to some kind of error in interpreting the genotypes of symbolic variants like DUPs and INVs. An example I found listed a symbolic variant (DUP, with genotypes HOMALT, ungenotyped, HOMALT) twice, once with the correct genotypes (2,-1,2) and one with this weird . genotype for both parents (2,.,.).