Closed ranijames closed 4 years ago
Dear Alva,
I can reproduce the problem partly. I'm running docker release_97.4. If my index file is missing I get the following message: Couldn't find index for file /opt/vep/.vep/Homo_sapiens.GRCh38.97.chr.gtf.gz at /usr/local/lib/x86_64-linux-gnu/perl/5.26.1/Bio/DB/HTS/Tabix.pm line 52. I then have to sort, bgzip and tabix Homo_sapiens.GRCh38.97.chr.gtf.gz. At the the end I created /opt/vep/.vep/Homo_sapiens.GRCh38.97.chr.gtf.gz.tbi. Running the vep command again after creating the tabix file works fine for me.
Was the tabix -p gff Homo_sapiens.GRCh38.97.gtf.gz command successful and did it create Homo_sapiens.GRCh38.97.gtf.gz.tbi?
Which ensembl tag version are you using for your docker image?
Best wishes, Anja
Dear Anja,
Thanks for the reply and for testing the case.
The docker version I am using is Version: 19.03.5
. The Tag for ensemble VEP image is latest. I have the Homo_sapiens.GRCh38.97.gtf.gz.tbi
file in my input folder.
The tabix -p gff Homo_sapiens.GRCh38.97.gtf.gz
was successful.
As it did not throw any sort of error message. However, I can redo it again.
I have deleted the .tbi
file and re-ran thetabix
command line for generating new index file. And tried running VEP
in docker. Now, it is throwing a new error,
vep@44487d84ba8e:~/src/ensembl-vep$ ./vep --cache --offline --format vcf --force_overwrite --gtf /opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf --species homo_sapiens --dir_cache /opt/vep/.vep/ --dir_plugins /opt/vep/.vep/Plugins/ --input_file /opt/vep/.vep/ukb_SPB_50k_exome_seq_filtered_for_VEP.vcf.gz --output_file /opt/vep/.vep/my_output.vcf
ERROR: Input file is not bgzipped, cannot use tabix
at /opt/vep/src/ensembl-vep/Bio/EnsEMBL/IO/TabixParser.pm line 52.
Bio::EnsEMBL::IO::TabixParser::open("Bio::EnsEMBL::IO::Parser::GTFTabix", "/opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf", "must_parse_metadata", 0) called at /opt/vep/src/ensembl-vep/Bio/EnsEMBL/IO/Parser/GTFTabix.pm line 50
Bio::EnsEMBL::IO::Parser::GTFTabix::open("Bio::EnsEMBL::IO::Parser::GTFTabix", "/opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf", "must_parse_metadata", 0) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/File/GTF.pm line 101
Bio::EnsEMBL::VEP::AnnotationSource::File::GTF::parser(Bio::EnsEMBL::VEP::AnnotationSource::File::GTF=HASH(0x55a901501530)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/File.pm line 326
Bio::EnsEMBL::VEP::AnnotationSource::File::valid_chromosomes(Bio::EnsEMBL::VEP::AnnotationSource::File::GTF=HASH(0x55a901501530)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/BaseRunner.pm line 386
Bio::EnsEMBL::VEP::BaseRunner::valid_chromosomes(Bio::EnsEMBL::VEP::Runner=HASH(0x55a8fe207a60)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm line 791
Bio::EnsEMBL::VEP::Runner::get_Parser(Bio::EnsEMBL::VEP::Runner=HASH(0x55a8fe207a60)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm line 818
Bio::EnsEMBL::VEP::Runner::get_InputBuffer(Bio::EnsEMBL::VEP::Runner=HASH(0x55a8fe207a60)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm line 131
Bio::EnsEMBL::VEP::Runner::init(Bio::EnsEMBL::VEP::Runner=HASH(0x55a8fe207a60)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm line 194
Bio::EnsEMBL::VEP::Runner::run(Bio::EnsEMBL::VEP::Runner=HASH(0x55a8fe207a60)) called at ./vep line 224
Whether or not the issue could be because I am using my own reference annotation file (the gtf
file)?
Let's try compressing and indexing Homo_sapiens.GRCh38.97.gtf again. Could you please: gunzip Homo_sapiens.GRCh38.97.gtf.gz try bgzip Homo_sapiens.GRCh38.97.gtf. It should work if Homo_sapiens.GRCh38.97.gtf is sorted. Otherwise you need to sort Homo_sapiens.GRCh38.97.gtf. I ran: mv Homo_sapiens.GRCh38.97.gtf Homo_sapiens.GRCh38.97.unsorted.gtf (grep ^"#" Homo_sapiens.GRCh38.97.unsorted.gtf; grep -v ^"#" Homo_sapiens.GRCh38.97.unsorted.gtf | sort -k1,1 -k4,4n) | bgzip > Homo_sapiens.GRCh38.97.gtf.gz
tabix -p gff Homo_sapiens.GRCh38.97.gtf.gz
Hopefully after those steps the error message should go away. Let me know how you are getting on.
Hey, I tried all the steps as you mentioned, as follows,
gunzip Homo_sapiens.GRCh38.97.gtf.gz
bgzip Homo_sapiens.GRCh38.97.gtf
# it worked without any error, as the GTF file was sorted.
tabix -p gff Homo_sapiens.GRCh38.97.gtf.gz
After the above steps, within the docker container after activating the docker using the following one-liner,
docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep
Within the docker container, I ran the following one-liner to run the docker image for VEP tool,
./vep --cache --offline --format vcf --force_overwrite --gtf /opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf --species homo_sapiens --dir_cache /opt/vep/.vep/ --dir_plugins /opt/vep/.vep/Plugins/ --input_file /opt/vep/.vep/ukb_SPB_50k_exome_seq_filtered_for_VEP.vcf.gz --output_file /opt/vep/.vep/my_output.vcf
Unfortunately, the tool agagin throws the same error message,
ERROR: Input file is not bgzipped, cannot use tabix
at /opt/vep/src/ensembl-vep/Bio/EnsEMBL/IO/TabixParser.pm line 52.
Bio::EnsEMBL::IO::TabixParser::open("Bio::EnsEMBL::IO::Parser::GTFTabix", "/opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf", "must_parse_metadata", 0) called at /opt/vep/src/ensembl-vep/Bio/EnsEMBL/IO/Parser/GTFTabix.pm line 50
Bio::EnsEMBL::IO::Parser::GTFTabix::open("Bio::EnsEMBL::IO::Parser::GTFTabix", "/opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf", "must_parse_metadata", 0) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/File/GTF.pm line 101
Bio::EnsEMBL::VEP::AnnotationSource::File::GTF::parser(Bio::EnsEMBL::VEP::AnnotationSource::File::GTF=HASH(0x559a99939b50)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/File.pm line 326
Bio::EnsEMBL::VEP::AnnotationSource::File::valid_chromosomes(Bio::EnsEMBL::VEP::AnnotationSource::File::GTF=HASH(0x559a99939b50)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/BaseRunner.pm line 386
Bio::EnsEMBL::VEP::BaseRunner::valid_chromosomes(Bio::EnsEMBL::VEP::Runner=HASH(0x559a9652ed70)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm line 791
Bio::EnsEMBL::VEP::Runner::get_Parser(Bio::EnsEMBL::VEP::Runner=HASH(0x559a9652ed70)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm line 818
Bio::EnsEMBL::VEP::Runner::get_InputBuffer(Bio::EnsEMBL::VEP::Runner=HASH(0x559a9652ed70)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm line 131
Bio::EnsEMBL::VEP::Runner::init(Bio::EnsEMBL::VEP::Runner=HASH(0x559a9652ed70)) called at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm line 194
Bio::EnsEMBL::VEP::Runner::run(Bio::EnsEMBL::VEP::Runner=HASH(0x559a9652ed70)) called at ./vep line 224
It would be really great to know if there is any workarounds for this sprt of index errors. Thank you
I just noticed that you are using --gtf /opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf in you vep command. But it must be --gtf /opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf.gz. Could you please update the --gtf option and try running again?
Yes, I tried as you mentioned,
./vep --cache --offline --format vcf --force_overwrite --gtf /opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf.gz --species homo_sapiens --dir_cache /opt/vep/.vep/ --dir_plugins /opt/vep/.vep/Plugins/ --input_file /opt/vep/.vep/ukb_SPB_50k_exome_seq_filtered_for_VEP.vcf.gz --output_file /opt/vep/.vep/my_output.vcf
It is now throwing the previous error,
[E::hts_open_format] Failed to open file /opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf.gz
Couldn't find index for file /opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf.gz at /usr/local/lib/x86_64-linux-gnu/perl/5.26.1/Bio/DB/HTS/Tabix.pm line 53.
Could you please check if you can read the file from your directory?: vep@44487d84ba8e:~/src/ensembl-vep tabix -H /opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf.gz
it says Could not read /opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf.gz
My bad the file is in the input directory, I changed the path. Sorry about that. Now, there is a new issue.
./vep --cache --offline --format vcf --force_overwrite --gtf /opt/vep/.vep/input/Homo_sapiens.GRCh38.97.gtf.gz --species homo_sapiens --dir_cache /opt/vep/.vep/ --dir_plugins /opt/vep/.vep/Plugins/ --input_file /opt/vep/.vep/input/ukb_SPB_50k_exome_seq_filtered_for_VEP.vcf.gz --output_file /opt/vep/.vep/output/my_output.vcf
Throws the following error,
-------------------- EXCEPTION --------------------
MSG: ERROR: Could not write to stats file /opt/vep/.vep/output/my_output.vcf_summary.html
STACK Bio::EnsEMBL::VEP::BaseRunner::get_stats_file_handle /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/BaseRunner.pm:280
STACK Bio::EnsEMBL::VEP::Runner::init /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:142
STACK Bio::EnsEMBL::VEP::Runner::run /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:194
STACK toplevel ./vep:224
Date (localtime) = Fri Feb 21 16:47:40 2020
Ensembl API version = 99
---------------------------------------------------
If this is a complete new issue I can close this and open a new case. Thanks and I really appreciate your time and help!
Did you follow the steps from our documentation page?
Create a directory on your machine:
mkdir $HOME/vep_data
Make sure that the created directory on your machine has read and write access granted so the docker container can write in the directory (VEP output):
chmod a+rwx $HOME/vep_data
docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep
What are the file permissions for $HOME/vep_data/Homo_sapiens.GRCh38.97.gtf.gz?
Yes, I followed the permission granting step,
chmod a+rwx $HOME/vep_data
The directories have read write and access permissions.
I'm glad the tabix index problem is sorted! For the new warning "Could not write to stats file". Have you created the output directory under $HOME/vep_data?
Yes, I have an output directory within the VEP folder. I see the output directory had no write permission and once I granted that I am able to run the command line. Thank you so much for your patience and support!!!
Now, there is a new issue. This is perhaps to do with the annotation. It would be great if you could guide me here.
WARNING: Parent entries with the following IDs were not found or skipped due to invalid types:
And a lot of ENSGIDs
... It just keeps on showing the same warning again and again with new ENGSIDs
. I would like to know whether or not I am missing something here in the command line.
I am using GRCH38 reference annotation. Do I need to use a FASTA file along with the reference annotation?
Perfect! There are some GTF/GFF format expectations which are documented here Entities are linked by an attribute named for the parent entity type e.g. exon is linked to transcript by transcript_id, transcript is linked to gene by gene_id. Transcript biotypes are defined in attributes named "biotype", "transcript_biotype" or "transcript_type". If none of these exist, VEP will attempt to interpret the source field (2nd column) of the GTF as the biotype. Could you please check if your GTF fulfils those requirements? You can also send me some of the problematic lines of your GTF file and I can try and check.
Dear Alva, I'm going to close this issue. Please open a new issue if you have any questions with regard to the GTF/GFF file formats. Best regards, Anja
Thanks, Anja for the support. The tool now generates the outputs after applying your comments. I have an additional question, which is not an issue. The output of the file has protein alterations for each transcript/isoform of each gene. I would like to have only the gene-wise alterations. Meaning that for each variant of the gene just specific alteration for that variant position. I have not seen any parameter after a quick end-to-end reading the Manuel. Could you please direct me to the documentation or something if you have any. Thanks
Dear All, I am trying to use VEP docker image for predicting protein interactions of my variants(.vcf file). Here is what I tried, within my docker container,
./vep --cache --offline --format vcf --force_overwrite --gtf /opt/vep/.vep/Homo_sapiens.GRCh38.97.gtf.gz --species homo_sapiens --dir_cache /opt/vep/.vep/ --dir_plugins /opt/vep/.vep/Plugins/ --input_file /opt/vep/.vep/ukb_SPB_50k_exome_seq_filtered_for_VEP.vcf.gz --output_file /opt/vep/.vep/my_output.vcf
However, the above command line is throwing the following error,
The GTF file is sorted and gzipped. And, later indexed using
TABIX
using the following line,tabix -p gff Homo_sapiens.GRCh38.97.gtf.gz
I checked whether the compression is correct using,
I would be really thankful if someone please help me find where I am going wrong.
Thank you!