but overlap the start of the interval in the bed file.
i used tabix in the past, but it is pretty inefficient since I was doing a call for each interval and when using a bed file it did not handle index ranges, but I was using a version of tabix from 7 years ago. hopefully it is better now.
proposed workflow so it's clear in my head before talking to molly.
per sample gvcf files are created using a static super padded bed file. pad is 250 base pairs...i think it is actually generated within the pipeline but the inputs and pad are static.
joint calling is performed on these using the subject samples combined with the control samples. this file is deleted after the analysis run has completed as it is just an intermediate output file. vqsr is ran on this.
tabix (hopefully) is used to extract the variants included in the target bed file plus the gui defined padding (which is a subset of the master temporary master vcf file above). this would also contain deletions that start outside the target plus padded interval but overlap them. this is the final full vcf file. it does not contain variants that are outside the targets plus pad unless it is an overlapping deletion. note: that vqsr can not be done on this file b/c it can be a zoom which is too small to run vqsr on. in addition, to further remove the layer of master vcf containing all variants.
a subject group (can be a singleton or some family unit) vcf is created containing only those locations that are variant within the group of subjects (this would be for emedgene, e.g.). this is extracted from the file in #3 which only contains variants within the target plus pad plus overlapping deletions.
variants are extracted at the per sample level. containing only those loci that are variant for the sample. this is a subset of #4 and would not contain variants outside the target bed plus padding plus overlapping deletions. this would be the phenodb vcf file.
but overlap the start of the interval in the bed file.
i used tabix in the past, but it is pretty inefficient since I was doing a call for each interval and when using a bed file it did not handle index ranges, but I was using a version of tabix from 7 years ago. hopefully it is better now.
proposed workflow so it's clear in my head before talking to molly.
per sample gvcf files are created using a static super padded bed file. pad is 250 base pairs...i think it is actually generated within the pipeline but the inputs and pad are static.
joint calling is performed on these using the subject samples combined with the control samples. this file is deleted after the analysis run has completed as it is just an intermediate output file. vqsr is ran on this.
tabix (hopefully) is used to extract the variants included in the target bed file plus the gui defined padding (which is a subset of the master temporary master vcf file above). this would also contain deletions that start outside the target plus padded interval but overlap them. this is the final full vcf file. it does not contain variants that are outside the targets plus pad unless it is an overlapping deletion. note: that vqsr can not be done on this file b/c it can be a zoom which is too small to run vqsr on. in addition, to further remove the layer of master vcf containing all variants.
a subject group (can be a singleton or some family unit) vcf is created containing only those locations that are variant within the group of subjects (this would be for emedgene, e.g.). this is extracted from the file in #3 which only contains variants within the target plus pad plus overlapping deletions.
variants are extracted at the per sample level. containing only those loci that are variant for the sample. this is a subset of #4 and would not contain variants outside the target bed plus padding plus overlapping deletions. this would be the phenodb vcf file.