Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
456 stars 151 forks source link

VEP throwing index error even though the input GTF file is indexed using tabix #695

Closed ranijames closed 4 years ago

ranijames commented 4 years ago

Dear All, I am trying to use VEP docker image for predicting protein interactions of my variants(.vcf file). Here is what I tried, within my docker container,

./vep --cache --offline --format vcf --force_overwrite --gtf /opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf.gz --species homo_sapiens --dir_cache /opt/vep/.vep/ --dir_plugins /opt/vep/.vep/Plugins/ --input_file /opt/vep/.vep/ukb_SPB_50k_exome_seq_filtered_for_VEP.vcf.gz --output_file /opt/vep/.vep/my_output.vcf

However, the above command line is throwing the following error,

[E::hts_open_format] Failed to open file /opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf.gz
Couldn't find index for file /opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf.gz at /usr/local/lib/x86_64-linux-gnu/perl/5.26.1/Bio/DB/HTS/Tabix.pm line 53.

The GTF file is sorted and gzipped. And, later indexed using TABIX using the following line,

tabix -p gff Homo_sapiens.GRCh38.97.gtf.gz

I checked whether the compression is correct using,

htsfile Homo_sapiens.GRCh38.97.gtf.gz
Homo_sapiens.GRCh38.97.gtf.gz:  SAM version 1 BGZF-compressed sequence data

I would be really thankful if someone please help me find where I am going wrong.

Thank you!

at7 commented 4 years ago

Dear Alva,

I can reproduce the problem partly. I'm running docker release_97.4. If my index file is missing I get the following message: Couldn't find index for file /opt/vep/.vep/Homo_sapiens.GRCh38.97.chr.gtf.gz at /usr/local/lib/x86_64-linux-gnu/perl/5.26.1/Bio/DB/HTS/Tabix.pm line 52. I then have to sort, bgzip and tabix Homo_sapiens.GRCh38.97.chr.gtf.gz. At the the end I created /opt/vep/.vep/Homo_sapiens.GRCh38.97.chr.gtf.gz.tbi. Running the vep command again after creating the tabix file works fine for me.

Was the tabix -p gff Homo_sapiens.GRCh38.97.gtf.gz command successful and did it create Homo_sapiens.GRCh38.97.gtf.gz.tbi?

Which ensembl tag version are you using for your docker image?

Best wishes, Anja

ranijames commented 4 years ago

Dear Anja,

Thanks for the reply and for testing the case.

The docker version I am using is Version: 19.03.5. The Tag for ensemble VEP image is latest. I have the Homo_sapiens.GRCh38.97.gtf.gz.tbi file in my input folder.

The tabix -p gff Homo_sapiens.GRCh38.97.gtf.gz was successful. As it did not throw any sort of error message. However, I can redo it again. I have deleted the .tbi file and re-ran thetabix command line for generating new index file. And tried running VEP in docker. Now, it is throwing a new error,

vep@44487d84ba8e:~/src/ensembl-vep$ ./vep --cache --offline --format vcf --force_overwrite --gtf /opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf --species homo_sapiens --dir_cache /opt/vep/.vep/ --dir_plugins /opt/vep/.vep/Plugins/ --input_file /opt/vep/.vep/ukb_SPB_50k_exome_seq_filtered_for_VEP.vcf.gz --output_file /opt/vep/.vep/my_output.vcf
ERROR: Input file is not bgzipped, cannot use tabix
 at /opt/vep/src/ensembl-vep/Bio/EnsEMBL/IO/TabixParser.pm line 52.
    Bio::EnsEMBL::IO::TabixParser::open("Bio::EnsEMBL::IO::Parser::GTFTabix", "/opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf", "must_parse_metadata", 0) called at /opt/vep/src/ensembl-vep/Bio/EnsEMBL/IO/Parser/GTFTabix.pm line 50
    Bio::EnsEMBL::IO::Parser::GTFTabix::open("Bio::EnsEMBL::IO::Parser::GTFTabix", "/opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf", "must_parse_metadata", 0) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/File/GTF.pm line 101
    Bio::EnsEMBL::VEP::AnnotationSource::File::GTF::parser(Bio::EnsEMBL::VEP::AnnotationSource::File::GTF=HASH(0x55a901501530)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/File.pm line 326
    Bio::EnsEMBL::VEP::AnnotationSource::File::valid_chromosomes(Bio::EnsEMBL::VEP::AnnotationSource::File::GTF=HASH(0x55a901501530)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/BaseRunner.pm line 386
    Bio::EnsEMBL::VEP::BaseRunner::valid_chromosomes(Bio::EnsEMBL::VEP::Runner=HASH(0x55a8fe207a60)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm line 791
    Bio::EnsEMBL::VEP::Runner::get_Parser(Bio::EnsEMBL::VEP::Runner=HASH(0x55a8fe207a60)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm line 818
    Bio::EnsEMBL::VEP::Runner::get_InputBuffer(Bio::EnsEMBL::VEP::Runner=HASH(0x55a8fe207a60)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm line 131
    Bio::EnsEMBL::VEP::Runner::init(Bio::EnsEMBL::VEP::Runner=HASH(0x55a8fe207a60)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm line 194
    Bio::EnsEMBL::VEP::Runner::run(Bio::EnsEMBL::VEP::Runner=HASH(0x55a8fe207a60)) called at ./vep line 224

Whether or not the issue could be because I am using my own reference annotation file (the gtf file)?

at7 commented 4 years ago

Let's try compressing and indexing Homo_sapiens.GRCh38.97.gtf again. Could you please: gunzip Homo_sapiens.GRCh38.97.gtf.gz try bgzip Homo_sapiens.GRCh38.97.gtf. It should work if Homo_sapiens.GRCh38.97.gtf is sorted. Otherwise you need to sort Homo_sapiens.GRCh38.97.gtf. I ran: mv Homo_sapiens.GRCh38.97.gtf Homo_sapiens.GRCh38.97.unsorted.gtf (grep ^"#" Homo_sapiens.GRCh38.97.unsorted.gtf; grep -v ^"#" Homo_sapiens.GRCh38.97.unsorted.gtf | sort -k1,1 -k4,4n) | bgzip > Homo_sapiens.GRCh38.97.gtf.gz

tabix -p gff Homo_sapiens.GRCh38.97.gtf.gz

Hopefully after those steps the error message should go away. Let me know how you are getting on.

ranijames commented 4 years ago

Hey, I tried all the steps as you mentioned, as follows,

docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep

Within the docker container, I ran the following one-liner to run the docker image for VEP tool,

./vep --cache --offline --format vcf --force_overwrite --gtf /opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf --species homo_sapiens --dir_cache /opt/vep/.vep/ --dir_plugins /opt/vep/.vep/Plugins/ --input_file /opt/vep/.vep/ukb_SPB_50k_exome_seq_filtered_for_VEP.vcf.gz --output_file /opt/vep/.vep/my_output.vcf

Unfortunately, the tool agagin throws the same error message,

ERROR: Input file is not bgzipped, cannot use tabix
 at /opt/vep/src/ensembl-vep/Bio/EnsEMBL/IO/TabixParser.pm line 52.
    Bio::EnsEMBL::IO::TabixParser::open("Bio::EnsEMBL::IO::Parser::GTFTabix", "/opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf", "must_parse_metadata", 0) called at /opt/vep/src/ensembl-vep/Bio/EnsEMBL/IO/Parser/GTFTabix.pm line 50
    Bio::EnsEMBL::IO::Parser::GTFTabix::open("Bio::EnsEMBL::IO::Parser::GTFTabix", "/opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf", "must_parse_metadata", 0) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/File/GTF.pm line 101
    Bio::EnsEMBL::VEP::AnnotationSource::File::GTF::parser(Bio::EnsEMBL::VEP::AnnotationSource::File::GTF=HASH(0x559a99939b50)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/File.pm line 326
    Bio::EnsEMBL::VEP::AnnotationSource::File::valid_chromosomes(Bio::EnsEMBL::VEP::AnnotationSource::File::GTF=HASH(0x559a99939b50)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/BaseRunner.pm line 386
    Bio::EnsEMBL::VEP::BaseRunner::valid_chromosomes(Bio::EnsEMBL::VEP::Runner=HASH(0x559a9652ed70)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm line 791
    Bio::EnsEMBL::VEP::Runner::get_Parser(Bio::EnsEMBL::VEP::Runner=HASH(0x559a9652ed70)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm line 818
    Bio::EnsEMBL::VEP::Runner::get_InputBuffer(Bio::EnsEMBL::VEP::Runner=HASH(0x559a9652ed70)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm line 131
    Bio::EnsEMBL::VEP::Runner::init(Bio::EnsEMBL::VEP::Runner=HASH(0x559a9652ed70)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm line 194
    Bio::EnsEMBL::VEP::Runner::run(Bio::EnsEMBL::VEP::Runner=HASH(0x559a9652ed70)) called at ./vep line 224

It would be really great to know if there is any workarounds for this sprt of index errors. Thank you

at7 commented 4 years ago

I just noticed that you are using --gtf /opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf in you vep command. But it must be --gtf /opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf.gz. Could you please update the --gtf option and try running again?

ranijames commented 4 years ago

Yes, I tried as you mentioned, ./vep --cache --offline --format vcf --force_overwrite --gtf /opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf.gz --species homo_sapiens --dir_cache /opt/vep/.vep/ --dir_plugins /opt/vep/.vep/Plugins/ --input_file /opt/vep/.vep/ukb_SPB_50k_exome_seq_filtered_for_VEP.vcf.gz --output_file /opt/vep/.vep/my_output.vcf

It is now throwing the previous error,

[E::hts_open_format] Failed to open file /opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf.gz
Couldn't find index for file /opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf.gz at /usr/local/lib/x86_64-linux-gnu/perl/5.26.1/Bio/DB/HTS/Tabix.pm line 53.
at7 commented 4 years ago

Could you please check if you can read the file from your directory?: vep@44487d84ba8e:~/src/ensembl-vep tabix -H /opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf.gz

ranijames commented 4 years ago

it says Could not read /opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf.gz My bad the file is in the input directory, I changed the path. Sorry about that. Now, there is a new issue.

./vep --cache --offline --format vcf --force_overwrite --gtf /opt/vep/.vep/input/Homo_sapiens.GRCh38.97.gtf.gz --species homo_sapiens --dir_cache /opt/vep/.vep/ --dir_plugins /opt/vep/.vep/Plugins/ --input_file /opt/vep/.vep/input/ukb_SPB_50k_exome_seq_filtered_for_VEP.vcf.gz --output_file /opt/vep/.vep/output/my_output.vcf

Throws the following error,

-------------------- EXCEPTION --------------------
MSG: ERROR: Could not write to stats file /opt/vep/.vep/output/my_output.vcf_summary.html

STACK Bio::EnsEMBL::VEP::BaseRunner::get_stats_file_handle /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/BaseRunner.pm:280
STACK Bio::EnsEMBL::VEP::Runner::init /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:142
STACK Bio::EnsEMBL::VEP::Runner::run /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:194
STACK toplevel ./vep:224
Date (localtime)    = Fri Feb 21 16:47:40 2020
Ensembl API version = 99
---------------------------------------------------

If this is a complete new issue I can close this and open a new case. Thanks and I really appreciate your time and help!

at7 commented 4 years ago

Did you follow the steps from our documentation page?

Create a directory on your machine: mkdir $HOME/vep_data Make sure that the created directory on your machine has read and write access granted so the docker container can write in the directory (VEP output): chmod a+rwx $HOME/vep_data
docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep

What are the file permissions for $HOME/vep_data/Homo_sapiens.GRCh38.97.gtf.gz?

ranijames commented 4 years ago

Yes, I followed the permission granting step,

chmod a+rwx $HOME/vep_data

The directories have read write and access permissions.

at7 commented 4 years ago

I'm glad the tabix index problem is sorted! For the new warning "Could not write to stats file". Have you created the output directory under $HOME/vep_data?

ranijames commented 4 years ago

Yes, I have an output directory within the VEP folder. I see the output directory had no write permission and once I granted that I am able to run the command line. Thank you so much for your patience and support!!!

Now, there is a new issue. This is perhaps to do with the annotation. It would be great if you could guide me here.

WARNING: Parent entries with the following IDs were not found or skipped due to invalid types:

And a lot of ENSGIDs... It just keeps on showing the same warning again and again with new ENGSIDs. I would like to know whether or not I am missing something here in the command line. I am using GRCH38 reference annotation. Do I need to use a FASTA file along with the reference annotation?

at7 commented 4 years ago

Perfect! There are some GTF/GFF format expectations which are documented here Entities are linked by an attribute named for the parent entity type e.g. exon is linked to transcript by transcript_id, transcript is linked to gene by gene_id. Transcript biotypes are defined in attributes named "biotype", "transcript_biotype" or "transcript_type". If none of these exist, VEP will attempt to interpret the source field (2nd column) of the GTF as the biotype. Could you please check if your GTF fulfils those requirements? You can also send me some of the problematic lines of your GTF file and I can try and check.

at7 commented 4 years ago

Dear Alva, I'm going to close this issue. Please open a new issue if you have any questions with regard to the GTF/GFF file formats. Best regards, Anja

ranijames commented 4 years ago

Thanks, Anja for the support. The tool now generates the outputs after applying your comments. I have an additional question, which is not an issue. The output of the file has protein alterations for each transcript/isoform of each gene. I would like to have only the gene-wise alterations. Meaning that for each variant of the gene just specific alteration for that variant position. I have not seen any parameter after a quick end-to-end reading the Manuel. Could you please direct me to the documentation or something if you have any. Thanks

at7 commented 4 years ago

We have an FAQ page and I was wondering if the answer to Q: Why do I see so many lines of output for each variant in my input? might be helpful to address your problem? Let me know if not.