Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
437 stars 149 forks source link

INSTALL.pl -f GRCh37 fasta not bgzipped #1639

Open davmlaw opened 3 months ago

davmlaw commented 3 months ago

The GRCh37 fasta file downloaded by the install process is gzipped, not bgzipped

perl INSTALL.pl --AUTO f --ASSEMBLY GRCh37 --SPECIES homo_sapiens --CACHEDIR ${VEP_CACHE}

Running the code w/o fasta gives:

/data/annotation/VEP/ensembl-vep/vep -i /home/dlawrence/localwork/variantgrid/data/annotation_dump/fake.vcf -o /home/dlawrence/localwork/variantgrid/data/annotation_dump/fake.vep_annotated_GRCh37.vcf.gz --cache --dir /data/annotation/VEP/vep_cache --assembly GRCh37 --offline --use_given_ref --vcf --compress_output gzip --force_overwrite --no_stats --hgvs
Smartmatch is experimental at /data/annotation/VEP/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/File.pm line 472.
[E::fai_build3_core] Cannot index files compressed with gzip, please use bgzip
Segmentation fault (core dumped)

Can you please change the GRCh37 file to be bgzipped?

Given that bgzip is backwards compatible with gzip, and only slightly larger, perhaps just make all Ensembl genome fasta files on the website/FTP be bgzipped?

jamie-m-a commented 3 months ago

Hi @davmlaw

I agree, it's a slightly unfortunate legacy of some software libraries not properly supporting bgzip files which has resulted in us still using gzip in the files we release. I am investigating the option to at least provide bgzipped alternative files for some of our more heavily used fasta files, though implementing this will take a little time.

In the meantime you can convert the fasta like so:

gunzip -c fasta.gz | bgzip > fasta.bgz

Sorry for the inconvenience!

davmlaw commented 3 months ago

Perhaps you could also release a .bgz to go with each .gz you release?

Then for VEP - download the .bgz?

jamie-m-a commented 3 months ago

Hi @davmlaw

Yes that would be my preference, but it will take a little time to implement.