Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
453 stars 152 forks source link

likely cause of `gzip: stdout: Broken pipe` #1100

Closed keiranmraine closed 2 years ago

keiranmraine commented 2 years ago

Describe the issue

Same as #720 (gzip: stdout: Broken pipe), however I am reporting the likely cause:

https://github.com/Ensembl/ensembl-vep/blob/5aa4f5c8bdcdeb4dc27ba114497601dacdc66e73/modules/Bio/EnsEMBL/VEP/Utils.pm#L432

It is likely that the file handle is being destroyed before the gzip stream is exhausted/closed which would result in this type of error. However it's difficult to create a minimal reproducer as this is likely caused by object destroy complexity.

It's possible that making IO::Uncompress::Gunzip the preferred method could remove this issue, however I would anticipate a performance reduction.

(@ThomasSClarke for info)

Additional information

Please fill in the following sections to help us find the source of your issue as quickly as possible.

System

species homo_sapiens
assembly        GRCh38
sift    b
polyphen        b
source_polyphen 2.2.2
source_sift     sift5.2.2
source_genebuild        2014-07
source_gencode  GENCODE 39
source_assembly GRCh38.p13
variation_cols  chr,variation_name,failed,somatic,start,end,allele_string,strand,minor_allele,minor_allele_freq,clin_sig,phenotype_or_disease,clin_sig_allele,pubmed,var_synonyms,AFR,AMR,EAS,EUR,SAS,AA,EA,gnomAD,gnomAD_AFR,gnomAD_AMR,gnomAD_ASJ,gnomAD_EAS,gnomAD_FIN,gnomAD_NFE,gnomAD_OTH,gnomAD_SAS
source_COSMIC   92
source_HGMD-PUBLIC      20204
source_ClinVar  202106
source_dbSNP    154
source_1000genomes      phase3
source_ESP      V2-SSA137
source_gnomAD   r2.1.1
var_type        tabix
regulatory      1
cell_types      A549,A673,B,B_(PB),CD14+_monocyte_(PB),CD14+_monocyte_1,CD4+_CD25+_ab_Treg_(PB),CD4+_ab_T,CD4+_ab_T_(PB)_1,CD4+_ab_T_(PB)_2,CD4+_ab_T_(Th),CD4+_ab_T_(VB),CD8+_ab_T_(CB),CD8+_ab_T_(PB),CMP_CD4+_1,CMP_CD4+_2,CMP_CD4+_3,CM_CD4+_ab_T_(VB),DND-41,EB_(CB),EM_CD4+_ab_T_(PB),EM_CD8+_ab_T_(VB),EPC_(VB),GM12878,H1-hESC_2,H1-hESC_3,H9_1,HCT116,HSMM,HUES48,HUES6,HUES64,HUVEC,HUVEC-prol_(CB),HeLa-S3,HepG2,K562,M0_(CB),M0_(VB),M1_(CB),M1_(VB),M2_(CB),M2_(VB),MCF-7,MM.1S,MSC,MSC_(VB),NHLF,NK_(PB),NPC_1,NPC_2,NPC_3,PC-3,PC-9,SK-N.,T_(PB),Th17,UCSF-4,adrenal_gland,aorta,astrocyte,bipolar_neuron,brain_1,cardiac_muscle,dermal_fibroblast,endodermal,eosinophil_(VB),esophagus,foreskin_fibroblast_2,foreskin_keratinocyte_1,foreskin_keratinocyte_2,foreskin_melanocyte_1,foreskin_melanocyte_2,germinal_matrix,heart,hepatocyte,iPS-15b,iPS-20b,iPS_DF_19.11,iPS_DF_6.9,keratinocyte,kidney,large_intestine,left_ventricle,leg_muscle,lung_1,lung_2,mammary_epithelial_1,mammary_epithelial_2,mammary_myoepithelial,monocyte_(CB),monocyte_(VB),mononuclear_(PB),myotube,naive_B_(VB),neuron,neurosphere_(C),neurosphere_(GE),neutro_myelocyte,neutrophil_(CB),neutrophil_(VB),osteoblast,ovary,pancreas,placenta,psoas_muscle,right_atrium,right_ventricle,sigmoid_colon,small_intestine_1,small_intestine_2,spleen,stomach_1,stomach_2,thymus_1,thymus_2,trophoblast,trunk_muscle
source_regbuild 1.0

Full VEP command line

vep --vcf --cache --dir vep_cache \
--custom cosmic.vep/CosmicCoding_Noncoding.normal.counts.vcf.gz,Cosmic,vcf,exact,0 \
--custom clinvar_20220103.chr.canonical.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN \
--custom dbsnp.vep/dbSNP155.GRCh38.GCF_000001405.39.mod.vcf.gz,dbSNP,vcf,exact,0,RS,dbSNPBuildID,SSR,PSEUDOGENEINFO,VC,FREQ \
--custom gnomad.genomes.v3.1.short.vcf.gz,gnomAD,vcf,exact,0,FLAG,AF \
DATA.vcf.gz \
-o DATA.vep.vcf.gz"

Full error message

Warning only:

gzip: stdout: Broken pipe
nakib103 commented 2 years ago

Hello @keiranmraine ,

Thanks you for looking into this and proposing a solution!

I have successfully reproduced what you mentioned. We are looking more into this to see if we can integrate this solution to VEP.

Best regards, Nakib

nakib103 commented 2 years ago

Hello @keiranmraine ,

We have some update regarding this issue.

Firstly, the core reason we are getting this issue is because of MySQL. It is default behavior of MySQL to mask SIGPIPE -

To avoid aborting the program when a connection terminates, MySQL blocks SIGPIPE on the first call to mysql_library_init(), mysql_init(), or mysql_connect().

And gzip does not like it. It does not seem to be a problem as long as perl handles it properly (some language like python may not like it at all). So, I have a test with several supported Perl version and the zipped and unzipped input give the same result. This this warning is false alarm.

We can make gzip to be the least preferred method but it would be counter-productive because of the performance issue. The warning can be easily made silent by unmasking SIGPIPE but caveat is that in VEP a db connection may also initiate later on (e.g - in some plugins). For now, we are keeping gzip as the second preferred method.

Thanks again for pointing out the issue.

Best regards, Nakib