bystrogenomics / bystro

Natural Language Search and Analysis of High Dimensional Genomic Data
Mozilla Public License 2.0
44 stars 14 forks source link

Documentation: Gather feedback on annotations to add for V1.0.0 release (do we add all gnomad annotations, etc) #256

Closed akotlar closed 1 year ago

akotlar commented 1 year ago
akotlar commented 1 year ago

Done:

Consensus w/ Dave:

1. Drop phyloP and phastCons. Old and superseded by CADD .
2. Add LOEUF - https://www.nature.com/articles/s41586-020-2308-7
3. Add MPC
4. dbSNP 156 - Use the VCF version, and match on exact allele. Will require new utility to translate dbSNP 156 VCF to format that separates out populations into separate fields, or have dbSNP VCF-specific build module (rather than re-use `vcf` build type)
5. Clinvar - Use the VCF version, match on exact allele, but continue to allow overlap of large alleles (CNVs) in refSeq using the old strategy (reading from Clinvar tsv)
6. CADD 1.6 - exact match
7. CADD 1.6 InDel - exact match
8. gnomad 2.1.1 exomes
9. gnomad 3.1 genomes
10. pLI at gene level

Gnomad details:

Also, I would start with gnomad 3 PASS ONLY.   There are a lot 
of weird, probably artifact stuff in the non-PASS tranches that might
not be helpful.  At most, if you include non-PASS FLAG the fuck out of
it, and make sure it is totally clear that there is some sort of evidence 
to suggest the calls / data at this site is suspect.

Also, I would include all the populations, and all the phenotype
sub-categories at the highest level, but not at the population level.

What do I mean.

I would include overall allele frequency and count as estimated by gnomad3

I would include overall allele frequency and count in all the "phenotype
sub groups"

Non-neuro
Controls
XY-humans --- apparently males isn't a word
XX-humans
Non-heart disease
Left-handed red heads.
etc.

Whatever.  There are like 5,000 of these subgroup phenotypes.   I would 
include all of them at the "top-level" of humans.

Below that I would only include the total population numbers, counts
frequency.

NFE
FIN
AFR
SAS
EAS
....

I don't need Amish broken down between those born XX versus XY,
etc.

This might be tricky to implement in code, but I think it is worth it 
for both storage and presentation purposes.  I kinda serious about
this. I really don't want to need to see allele frequencies broken down 
by  sex in the Ashkenazi controls.  

Lastly, Dave doesn't care about HGVS notation. I believe we'll want this for clinicians however.

poneill commented 1 year ago

Just verifying here that the feedback's been collected.