broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 587 forks source link

GDB feature request: sample subsetting for queries #5570

Open ldgauthier opened 5 years ago

ldgauthier commented 5 years ago

@nalinigans I'm not aware of any existing functionality to subset samples in a GenomicsDB query, but that would be very helpful. For the use case mentioned in #5569, some of the samples would ideally be excluded because they fail QC. For sites that are called in very few samples, doing the subsetting during the query will give the most accurate INFO field data. For example, if two samples have a variant, but then we remove one of them using SelectVariants, then the RAW_MQ, DP, and other annotations won't be able to be corrected and will still reflect a combination of the two samples. It would be especially helpful if it could also be applied to the sites-only query mode (which I think will likely come for free if it works in the standard query mode.)

kgururaj commented 5 years ago

I'm assuming you will have the subset of samples before creating a GenomicsDBFeatureReader object (and before creating the corresponding Protobuf export configuration object).

More precisely, you are NOT requesting a line by line filter similar to: At pos 100, compute INFO fields etc including only the samples whose QUAL > 5 At pos 102, compute INFO fields etc including only the samples whose QUAL > 5 ....

ldgauthier commented 5 years ago

Right. I want to subset by sample name, effectively taking a slice of the position by sample genotype matrix and computing Info annotations based only in the kept samples.

On Mon, Mar 4, 2019, 8:52 PM Karthik Gururaj notifications@github.com wrote:

I'm assuming you will have the subset of samples before creating a GenomicsDBFeatureReader object (and before creating the corresponding Protobuf export configuration object).

More precisely, you are NOT requesting a line by line filter similar to: At pos 100, compute INFO fields etc including only the samples whose QUAL

5 At pos 102, compute INFO fields etc including only the samples whose QUAL 5 ....

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/5570#issuecomment-469502322, or mute the thread https://github.com/notifications/unsubscribe-auth/AGRhdMZjIbDJ2eDZcB69XHiUycnumzHrks5vTc3PgaJpZM4Z7pF2 .

kgururaj commented 5 years ago

If the sample name does not exist in the vid map (JSON file), should we throw an error or print warning and continue?

nalinigans commented 5 years ago

This feature has been implemented in GenomicsDB by @kgururaj and is part of the 1.1.0.1 GenomicsDB release. PR #5970 will bring in 1.1.0.1 for this feature.