Closed akotlar closed 1 year ago
Done:
1. Drop phyloP and phastCons. Old and superseded by CADD .
2. Add LOEUF - https://www.nature.com/articles/s41586-020-2308-7
3. Add MPC
4. dbSNP 156 - Use the VCF version, and match on exact allele. Will require new utility to translate dbSNP 156 VCF to format that separates out populations into separate fields, or have dbSNP VCF-specific build module (rather than re-use `vcf` build type)
5. Clinvar - Use the VCF version, match on exact allele, but continue to allow overlap of large alleles (CNVs) in refSeq using the old strategy (reading from Clinvar tsv)
6. CADD 1.6 - exact match
7. CADD 1.6 InDel - exact match
8. gnomad 2.1.1 exomes
9. gnomad 3.1 genomes
10. pLI at gene level
Also, I would start with gnomad 3 PASS ONLY. There are a lot
of weird, probably artifact stuff in the non-PASS tranches that might
not be helpful. At most, if you include non-PASS FLAG the fuck out of
it, and make sure it is totally clear that there is some sort of evidence
to suggest the calls / data at this site is suspect.
Also, I would include all the populations, and all the phenotype
sub-categories at the highest level, but not at the population level.
What do I mean.
I would include overall allele frequency and count as estimated by gnomad3
I would include overall allele frequency and count in all the "phenotype
sub groups"
Non-neuro
Controls
XY-humans --- apparently males isn't a word
XX-humans
Non-heart disease
Left-handed red heads.
etc.
Whatever. There are like 5,000 of these subgroup phenotypes. I would
include all of them at the "top-level" of humans.
Below that I would only include the total population numbers, counts
frequency.
NFE
FIN
AFR
SAS
EAS
....
I don't need Amish broken down between those born XX versus XY,
etc.
This might be tricky to implement in code, but I think it is worth it
for both storage and presentation purposes. I kinda serious about
this. I really don't want to need to see allele frequencies broken down
by sex in the Ashkenazi controls.
Lastly, Dave doesn't care about HGVS notation. I believe we'll want this for clinicians however.
Just verifying here that the feedback's been collected.