PacificBiosciences / pb-human-wgs-workflow-snakemake

Workflow for the comprehensive detection and prioritization of variants in human genomes with PacBio HiFi reads
BSD 3-Clause Clear License
38 stars 20 forks source link

calculate_phrank #150

Closed lauragails closed 1 year ago

lauragails commented 1 year ago

Hello,

  1. Question to check my understanding RE phrank score: When reading the documentation, I assumed there would be 1 phrank score per phenotype studied, per gene, per individual. And this phrank score could be used to jumpstart clinical diagnoses.

ie title here is: "Calculate the "Phrank" phenotype match score for a list of phenotypes for every gene."

However there is 1 phrank score per gene in COHORT_phrank.tsv. What phenotype(s) are used to calculate the phrank score? Should this score be used to prioritize a gene in general? What is a good/high score?

This is my distribution of phrank scores across genes:

>=  <   4349
<0  0.6 2848.000000
0.6 1.2 0.000000
1.2 1.8 1052.000000
1.8 2.4 0.000000
2.4 3   0.000000
3   3.6 274.000000
3.6 4.2 0.000000
4.2 4.8 175.000000
4.8 5.4 0.000000
5.4 >6  0.000000

Corollary: When I look into svpack/COHORT.GRCh38.pbsv.svpack.tsv I see a table with the following columns #mode family_id sample_id chr:pos:ref:alt genotype(sample,dad,mom) SVTYPE SVLEN SVANN CIPOS MATEID END gene highest_impact depths(sample,dad,mom) allele_balance(sample,dad,mom) gene_impact_transcript lof clinvar phrank

The phrank scores seem to match the gene-level scores in the previous file.

If I ran every individual proband in its own cohort, would I get different phrank scores for each gene (and preferably for tested phenotypes)?

  1. Troubleshooting: This is the logifle for calculate_phrank, and I wanted to make sure nothing was impacted in the analysis
    
    `logs/calculate_phrank/COHORT.log`

/REDACTED/workflow/scripts/calculate_phrank.py:212: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. cohort_list = yaml.load("".join(yamlfile))

williamrowell commented 1 year ago

Question to check my understanding RE phrank score: When reading the documentation, I assumed there would be 1 phrank score per phenotype studied, per gene, per individual. And this phrank score could be used to jumpstart clinical diagnoses.

The Phrank method generates a score for each gene to describe how well variants in that gene could explain a set of phenotypes. Phrank scores are not normalized, and the maximum phrank score increases with the number of phenotypes provided. Think of it as a way to rank sort candidates rather than a threshold.

For our implementation:

  1. Given a set of HPO terms associated with a case, generate a phrank score per gene and store in a lookup table.
  2. Lookup every rare variant with functional consequence in svpack.vcf or deepvariant.vcf and associate with the phrank score in the affected gene. This goes into the tsv files.

If I ran every individual proband in its own cohort, would I get different phrank scores for each gene (and preferably for tested phenotypes)?

Every unrelated proband should be in its own cohort with its own HPO terms. Think of cohorts as families.

Troubleshooting: This is the logifle for calculate_phrank, and I wanted to make sure nothing was impacted in the analysis

YAMLLoadWarning is a warning that a function call we used is deprecated. No error has occurred.

lauragails commented 1 year ago

Got it. So in cohort.yaml I put "HP:0000717" for the whole cohort, because the whole cohort does have autism (but almost certainly with differing underlying genetic contributions). I will treat each cohort as a family going forward.

RE algorithm: Approximately can I assume: The set of HPO terms would be approximated by looking at the terms connected N degrees from this phenotype in: resources/hpo/hpoDag.txt

Genes mapped to each HPO term are in: resources/hpo/ensembl.hpoPhenotype.tsv

Thank you again!

williamrowell commented 1 year ago

I know our use of cohort is confusing, and we'll try to make it clearer going forward in future workflows. Maybe "case" or "family" would have been a better choice of words?