broadinstitute / genetic-prevalence-estimator

https://genie.broadinstitute.org/
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

Remove any subpopulations with <2,000 alleles #118

Closed sambaxter closed 1 year ago

sambaxter commented 2 years ago

We should only include subpopulations if they have more than 2,000 alleles. For v2 this is only japanese but in v4 this will likely be more populations (and japanese will likely have more than 2,000)

nawatts commented 2 years ago

I thought this was done in #48. But the way it's implemented, I think it's only filtering populations in v3 (I don't think v2 has the freq_sample_count global). Also, it's using 1000 as the threshold instead of 2000.

https://github.com/broadinstitute/aggregate-frequency-calculator/blob/e574c4eff177c99835ba1455956eb1c7c3a8b3ad/data-pipelines/prepare_gnomad_variants.py#L113-L116

sambaxter commented 2 years ago

The recommendation is 1,000 individuals or 2,000 alleles - is the sample count individual or allele?

I can see Japanese in the latest list

image

We can do a hard filter on Japanese for v2 if that easiest. For v4 they should have the freq_sample_count if its in v3.

nawatts commented 2 years ago

Oh, I forgot about individuals vs alleles.

Also, how should it work with exomes/genomes? 2000 alleles in exomes + genomes combined? Or filter each of exomes/genomes individually?

sambaxter commented 2 years ago

That is a very good question and I hadn't thought about it. I think we should start with 2,000 alleles in exomes + genomes combined for now. And this will be at the population level right (i.e. a population has more than 2,000 alleles in total in that particular version to be listed in the estimates)? I was working on the FAQ and it made me think whether we need to have a flag when a variant has an AN of less than 2,000 alleles in any population due to coverage or genomes only. I don't want it to get cluttered with too many flags but I do see a utility for this one.

nawatts commented 1 year ago

I think I may have been going about this the wrong way. I removed populations with less than 1000 individuals in gnomAD. But do we want to do this filter at the variant level? Exclude variants from calculations for a population if the variant has an allele number less than 2000 in that population?

sambaxter commented 1 year ago

You were doing it right. We should remove populations with less than 1000 individuals in gnomAD. I am wondering if we should also have a flag at the variant level if a variant has an allele number less than 2000 in a particular population. Typically these are genomes only, which we already have a flag for, but sometimes it can be just a variant that is in a low coverage region. But if that would be too messy I don't think it's critical.

On Mon, Sep 5, 2022 at 7:16 PM Nick Watts @.***> wrote:

I think I may have been going about this the wrong way. I removed populations with less than 1000 individuals in gnomAD. But do we want to do this filter at the variant level? Exclude variants from calculations for a population if the variant has an allele number less than 2000 in that population?

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/aggregate-frequency-calculator/issues/118#issuecomment-1237518904, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESSQ2GHFNI2UWJVKWGKNWDV4Z5NFANCNFSM57BPINUA . You are receiving this because you authored the thread.Message ID: <broadinstitute/aggregate-frequency-calculator/issues/118/1237518904@ github.com>

--

Samantha Baxter, MS, CGC

Associate Director, Genetic and Genomic Data Sharing

Licensed Genetic Counselor

@. @.>*

nawatts commented 1 year ago

Ok, in that case it sounds like I need to add in the sample count information for gnomAD v2 (the v2 Hail Table doesn't have the freq_sample_count global like v3 does).

https://github.com/broadinstitute/aggregate-frequency-calculator/blob/6df59ce066806a37b4be6c55dc2da1890cd94204/data-pipelines/prepare_gnomad_variants.py#L103-L122

And update how the populations are collected when there's both exomes and genomes. Currently, they are filtered independently.

https://github.com/broadinstitute/aggregate-frequency-calculator/blob/6df59ce066806a37b4be6c55dc2da1890cd94204/data-pipelines/prepare_gnomad_variants.py#L172-L174

sambaxter commented 1 year ago

If only v3 has it we could do this just for v3 and v4 (and skip v2). Would that be easier?

On Wed, Sep 7, 2022 at 11:39 AM Nick Watts @.***> wrote:

Ok, in that case it sounds like I need to add in the sample count information for gnomAD v2 (the v2 Hail Table doesn't have the freq_sample_count global like v3 does).

https://github.com/broadinstitute/aggregate-frequency-calculator/blob/6df59ce066806a37b4be6c55dc2da1890cd94204/data-pipelines/prepare_gnomad_variants.py#L103-L122

And update how the populations are collected when there's both exomes and genomes. Currently, they are filtered independently.

https://github.com/broadinstitute/aggregate-frequency-calculator/blob/6df59ce066806a37b4be6c55dc2da1890cd94204/data-pipelines/prepare_gnomad_variants.py#L172-L174

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/aggregate-frequency-calculator/issues/118#issuecomment-1239560157, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESSQ2A4LP5ANBDNHLGP23LV5CZMJANCNFSM57BPINUA . You are receiving this because you authored the thread.Message ID: <broadinstitute/aggregate-frequency-calculator/issues/118/1239560157@ github.com>

--

Samantha Baxter, MS, CGC

Associate Director, Genetic and Genomic Data Sharing

Licensed Genetic Counselor

@. @.>*

nawatts commented 1 year ago

And update how the populations are collected when there's both exomes and genomes. Currently, they are filtered independently.

I think this part will have to be done anyway for v4 (it has both exomes and genomes, right?)

And adding in sample counts for v2 shouldn't be difficult.

sambaxter commented 1 year ago

Ok great. Thank you!

On Wed, Sep 7, 2022 at 11:57 AM Nick Watts @.***> wrote:

And update how the populations are collected when there's both exomes and genomes. Currently, they are filtered independently.

I think this part will have to be done anyway for v4 (it has both exomes and genomes, right?)

And adding in sample counts for v2 shouldn't be difficult.

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/aggregate-frequency-calculator/issues/118#issuecomment-1239583112, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESSQ2FDV7X3OKKO2B5PF33V5C3M3ANCNFSM57BPINUA . You are receiving this because you authored the thread.Message ID: <broadinstitute/aggregate-frequency-calculator/issues/118/1239583112@ github.com>

--

Samantha Baxter, MS, CGC

Associate Director, Genetic and Genomic Data Sharing

Licensed Genetic Counselor

@. @.>*