NBISweden / GenErode

GitHub repository for GenErode, a Snakemake pipeline for the analysis of whole-genome sequencing data from historical and modern samples to study patterns of genome erosion.
GNU General Public License v3.0
23 stars 7 forks source link

compressing vcf file (rule remove_CpG_vcf - 8.1_vcf_CpG_filtering.smk) #19

Closed ndussex closed 2 years ago

ndussex commented 2 years ago

Hi,

In the '8.1_vcf_CpG_filtering.smk' file, the rule 'remove_CpG_vcf' uses bedtools intersect to generate a vcf file without CpG sites, such as:

bedtools intersect -a {input.vcf} -b {input.bed} -header -sorted -g {input.genomefile} > {output.filtered} 2> {log}

However, the {output.filtered} file is not compressed and thus very large, which means that a project directory can very quickly be full.

Would it be possible to compress this vcf file to save space with something like below to generate a *vcf.gz file?

bedtools intersect -a {input.vcf} -b {input.bed} -header -sorted -g {input.genomefile} | bgzip -c > {output.filtered} 2> {log}

Much appreciated, Nic

verku commented 2 years ago

Hi! The bedtools container used by the pipeline does not contain bgzip, so we'll create a new Docker container with bedtools and tabix/bgzip so that it's possible to compress the file.