Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
437 stars 150 forks source link

1000 genomes project custom annotation Issue #1683

Closed CASTANYMIQUEL closed 1 month ago

CASTANYMIQUEL commented 1 month ago

Dear all,

I'm tryin to run VEP with some custom parameters from public databases such as ClinVar, COSMIC and dbSNP and had no problem with it but when I'm trying to run it with a custom parameter for 1000 genomes project data I found the following error: Input file is not bgzipped. I tried it bgzipping and indexing the files prior to run the script but the same message appears. Without the 1000genomes custom parameter I am able to run the script with unzipped or bgzipped vcf files. For some of the other databases I had to manually change the #CHROM column (i.e. 1, 2, ... X to chr1, chr2,... chrX) and for 1000genomes I had to do it so to meet how chromosomes are specified in my data. I downloaded the data from their website (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/) and since it is split in different files for each one of the chromosomes I merged and sorted the info in one single file. After I bgzipped and indexed it using tabix. Also I had to manually modify the header to specify the chromosome IDs and lengths. I also tried to run the script with a single chromosome but the result is the same.

Additional information

System

Full VEP command line

singularity run /apps/VEP/111/ensembl-vep_latest.sif vep --input_file $bs.vcf.gz --output_file mutect_vep/$bs.vcf --format vcf --vcf --symbol --species homo_sapiens --fasta $genome --offline --cache --dir_cache $cache --assembly GRCh38 --custom file=$clinvar,short_name=ClinVar,format=vcf,type=exact,coords=0,fields=CLNSIG%CLNREVSTAT%CLNDN --custom file=$cosmic,short_name=COSMIC,format=vcf,type=exact,coords=0,fields=LEGACY_ID%AA%HGVSC%HGVSP%HGVSG%SAMPLE_COUNT%IS_CANONICAL%TIER%SO_TERM --custom file=$dbsnp,short_name=dbSNP,format=vcf,type=exact,coords=0,fields=RV --custom file=$1000g,short_name=1000genomes,format=vcf,type=exact,coords=0,fields=AF%EAS_AF%EUR_AF%AFR_AF%AMR_AF%SAS_AF

Full error message

ERROR: Input file is not bgzipped, cannot use tabix at /opt/vep/src/ensembl-vep/Bio/EnsEMBL/IO/TabixParser.pm line 52. Bio::EnsEMBL::IO::TabixParser::open("Bio::EnsEMBL::IO::Parser::VCF4Tabix", "000g") called at /opt/vep/src/ensembl-vep/Bio/EnsEMBL/IO/Parser/VCF4Tabix.pm line 53 Bio::EnsEMBL::IO::Parser::VCF4Tabix::open("Bio::EnsEMBL::IO::Parser::VCF4Tabix", "000g") called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/File/VCF.pm line 153 Bio::EnsEMBL::VEP::AnnotationSource::File::VCF::parser(Bio::EnsEMBL::VEP::AnnotationSource::File::VCF=HASH(0x56529de85540)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/File.pm line 362 Bio::EnsEMBL::VEP::AnnotationSource::File::valid_chromosomes(Bio::EnsEMBL::VEP::AnnotationSource::File::VCF=HASH(0x56529de85540)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/BaseRunner.pm line 444 Bio::EnsEMBL::VEP::BaseRunner::valid_chromosomes(Bio::EnsEMBL::VEP::Runner=HASH(0x565299ce8f10)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm line 802 Bio::EnsEMBL::VEP::Runner::get_Parser(Bio::EnsEMBL::VEP::Runner=HASH(0x565299ce8f10)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm line 829 Bio::EnsEMBL::VEP::Runner::get_InputBuffer(Bio::EnsEMBL::VEP::Runner=HASH(0x565299ce8f10)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm line 136 Bio::EnsEMBL::VEP::Runner::init(Bio::EnsEMBL::VEP::Runner=HASH(0x565299ce8f10)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm line 200 Bio::EnsEMBL::VEP::Runner::run(Bio::EnsEMBL::VEP::Runner=HASH(0x565299ce8f10)) called at /opt/vep/src/ensembl-vep/vep line 46

I would highly appreciate if anyone could help me find the solution.

Thank yo very much

olaaustine commented 1 month ago

Hi @CASTANYMIQUEL, Hope this meets you well and you are having a good day I have not been able to recreate this error using my own input file and the 1000Genomes Chr1 vcf file from the link shared above. To debug this issue, can you run VEP without any of the custom options, also if possible can you share your VCF input file. Thank you very much Ola.

CASTANYMIQUEL commented 1 month ago

Dear @olaaustine ,

Thank you for your interest in our problem. We just found the error and now the script works well. There was an issue in variable declaration of the custom file. We are not sure why our bash interpreter wasn't pointing it, but finally we realized it.

Thank you very much.

Miquel