clingen-data-model / clingen-hail-reports

Performs filtering against gnomAD and ClinVar datasets. Uses Hail to report records with a population FAF above certain thresholds by gene.
1 stars 0 forks source link

Nondeterministic popmax populations in gnomad freq reports #7

Closed theferrit32 closed 2 years ago

theferrit32 commented 3 years ago

Issue: When all FAF95 values are zero, FAF report always reports the popmax as the first gnomad superpopulation in the gnomad faf_index_dict

Result: This could result in nondeterministic popmax populations being reported, but in practice has resulted in afr being reported as the popmax for every variant with all 0 superpopulation FAF95s.

gnomAD reports these cases as null popmax, which is more meaningful, and should be done here.

Example: https://gnomad.broadinstitute.org/variant/16-2185899-C-T?dataset=gnomad_r2_1

KelseaChang5 commented 2 years ago

@larrybabb @KelseaChang5 to rename - more succinct?

theferrit32 commented 2 years ago

Example hail record from gnomad with added annotations using current popmax code in this repo. The faf records with "gnomad pop" comments are the relevant ones. Here they are all 0.0. When the faf_index_dict on the gnomad hail table is collected into a python list it happens to put them in alphabetic order based on label, yielding afr.

Struct(
    locus=Locus(contig=1, position=10114, reference_genome=GRCh37), alleles=['T', 'C'], 
    faf=[
        Struct(meta=frozendict({'group': 'adj'}), faf95=0.0, faf99=0.0), 
        Struct(meta=frozendict({'group': 'adj', 'pop': 'nfe'}), faf95=0.0, faf99=0.0), # gnomad pop
        Struct(meta=frozendict({'group': 'adj', 'pop': 'afr'}), faf95=0.0, faf99=0.0), # gnomad pop
        Struct(meta=frozendict({'group': 'adj', 'pop': 'eas'}), faf95=0.0, faf99=0.0), # gnomad pop
        Struct(meta=frozendict({'group': 'adj', 'pop': 'amr'}), faf95=0.0, faf99=0.0), # gnomad pop
        Struct(meta=frozendict({'group': 'adj'}), faf95=0.0, faf99=0.0), 
        Struct(meta=frozendict({'group': 'adj', 'pop': 'nfe'}), faf95=0.0, faf99=0.0), 
        Struct(meta=frozendict({'group': 'adj', 'pop': 'afr'}), faf95=0.0, faf99=0.0), 
        Struct(meta=frozendict({'group': 'adj', 'pop': 'eas'}), faf95=0.0, faf99=0.0), 
        Struct(meta=frozendict({'group': 'adj', 'pop': 'amr'}), faf95=0.0, faf99=0.0), 
        Struct(meta=frozendict({'group': 'adj'}), faf95=0.0, faf99=0.0), 
        Struct(meta=frozendict({'group': 'adj', 'pop': 'nfe'}), faf95=0.0, faf99=0.0), 
        Struct(meta=frozendict({'group': 'adj', 'pop': 'afr'}), faf95=0.0, faf99=0.0), 
        Struct(meta=frozendict({'group': 'adj', 'pop': 'eas'}), faf95=0.0, faf99=0.0), 
        Struct(meta=frozendict({'group': 'adj', 'pop': 'amr'}), faf95=0.0, faf99=0.0), 
        Struct(meta=frozendict({'group': 'adj'}), faf95=0.0, faf99=0.0), 
        Struct(meta=frozendict({'group': 'adj', 'pop': 'nfe'}), faf95=0.0, faf99=0.0), 
        Struct(meta=frozendict({'group': 'adj', 'pop': 'afr'}), faf95=0.0, faf99=0.0), 
        Struct(meta=frozendict({'group': 'adj', 'pop': 'eas'}), faf95=0.0, faf99=0.0), 
        Struct(meta=frozendict({'group': 'adj', 'pop': 'amr'}), faf95=0.0, faf99=0.0)], 
    popmax_faf=Struct(meta=frozendict({'group': 'adj', 'pop': 'afr'}), faf95=0.0, faf99=0.0), 
    popmax_index_dict_key='gnomad_afr', 
    popmax_faf_pop_freq=Struct(AC=0, AF=0.0, AN=2168, homozygote_count=0))
larrybabb commented 2 years ago

@KelseaChang5 can you follow up with Steve Harrison on this to verify that he understands the concern/bug being raised by @theferrit32? If he wants @theferrit32 to make changes to resolve this going forward then please capture that change here and assign to @theferrit32 otherwise, document that Steven understands and accepts the risk related to the potential misrepresentation of data that may be occurring.

KelseaChang5 commented 2 years ago

@KelseaChang5 can you follow up with Steve Harrison on this to verify that he understands the concern/bug being raised by @theferrit32? If he wants @theferrit32 to make changes to resolve this going forward then please capture that change here and assign to @theferrit32 otherwise, document that Steven understands and accepts the risk related to the potential misrepresentation of data that may be occurring.

Will do!

theferrit32 commented 2 years ago

@larrybabb @KelseaChang5 I think I'll just go ahead and fix this now while it's at the front of my mind and the issue might cause confusion in the output data.

larrybabb commented 2 years ago

Sounds good @theferrit32. I still think it's wise to make sure Steven H is aware of the issue and how it may impact the output as compared to past reports.

theferrit32 commented 2 years ago

This has been resolved as of bbd7af4418750a4a34fbbe83f906c61efa6b0b56. This also makes it so that output will include all subpopulations which had the maximum filtered allele frequency. There are some cases in gnomad where two populations have the same 95% confidence FAF and their value is the max.

The main changes for this were added in the add_popmax_af function of hgvs-annotate-gnomad-clinvar.ipynb

At the end before export, the notebook just concatenates with commas. In most cases there is only one popmax.