Open ldgauthier opened 5 years ago
I'm assuming you will have the subset of samples before creating a GenomicsDBFeatureReader object (and before creating the corresponding Protobuf export configuration object).
More precisely, you are NOT requesting a line by line filter similar to: At pos 100, compute INFO fields etc including only the samples whose QUAL > 5 At pos 102, compute INFO fields etc including only the samples whose QUAL > 5 ....
Right. I want to subset by sample name, effectively taking a slice of the position by sample genotype matrix and computing Info annotations based only in the kept samples.
On Mon, Mar 4, 2019, 8:52 PM Karthik Gururaj notifications@github.com wrote:
I'm assuming you will have the subset of samples before creating a GenomicsDBFeatureReader object (and before creating the corresponding Protobuf export configuration object).
More precisely, you are NOT requesting a line by line filter similar to: At pos 100, compute INFO fields etc including only the samples whose QUAL
5 At pos 102, compute INFO fields etc including only the samples whose QUAL 5 ....
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/5570#issuecomment-469502322, or mute the thread https://github.com/notifications/unsubscribe-auth/AGRhdMZjIbDJ2eDZcB69XHiUycnumzHrks5vTc3PgaJpZM4Z7pF2 .
If the sample name does not exist in the vid map (JSON file), should we throw an error or print warning and continue?
This feature has been implemented in GenomicsDB by @kgururaj and is part of the 1.1.0.1 GenomicsDB release. PR #5970 will bring in 1.1.0.1 for this feature.
@nalinigans I'm not aware of any existing functionality to subset samples in a GenomicsDB query, but that would be very helpful. For the use case mentioned in #5569, some of the samples would ideally be excluded because they fail QC. For sites that are called in very few samples, doing the subsetting during the query will give the most accurate INFO field data. For example, if two samples have a variant, but then we remove one of them using SelectVariants, then the RAW_MQ, DP, and other annotations won't be able to be corrected and will still reflect a combination of the two samples. It would be especially helpful if it could also be applied to the sites-only query mode (which I think will likely come for free if it works in the standard query mode.)