alexdobin / STAR

RNA-seq aligner
MIT License
1.85k stars 506 forks source link

No exon lines in the GTF file error during index generation #1585

Closed gcabebe closed 2 years ago

gcabebe commented 2 years ago

I would like to use STAR to calculate gene expression using RNA-seq data from Pseudomonas putida. The GTF file I'm using has no exon lines so I tried using --sjdbGTFfeatureExon CDS to avoid any "no exon lines in the GTF file" related error.

The GTF file I used can be found here. I checked it and I believe it only contains CDS, ncRNA, tRNA, rRNA features.

Below is how my shell script is formatted to execute this:

#!/bin/bash
#SBATCH -N 2
#SBATCH -n 4
#SBATCH --mem=0

script=$0
FASTA_FILE=$1 # ./Pseudomonas_putida_KT2440_110.fna
GTF_FILE=$2 # ./Pseudomonas_putida_KT2440_110.gtf
INDEX_DIR=$3 # ./index_STAR

STAR --runMode genomeGenerate --runThreadN 1 --genomeDir ${INDEX_DIR} --genomeFastaFiles ${FASTA_FILE} --sjdbGTFfile ${GTF_FILE} --sjdbGTFfeatureExon CDS --sjdbOverhang 49 --quantMode GeneCounts

Below is the output error I get:

        STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 17 13:11:01 ..... started STAR run
Jun 17 13:11:02 ... starting to generate Genome files
Jun 17 13:11:02 ..... processing annotations GTF

Fatal INPUT FILE error, no exon lines in the GTF file: /home/gcabebe/rnaseq/refseqKT2440/gtf/Pseudomonas_putida_KT2440_110.gtf
Solution: check the formatting of the GTF file, it must contain some lines with exon in the 3rd column.
          Make sure the GTF file is unzipped.
          If exons are marked with a different word, use --sjdbGTFfeatureExon .

Jun 17 13:11:02 ...... FATAL ERROR, exiting

Am I better off using a different sequence alignment tool like bwa or bowtie2, or could this be done on STAR? This is my first time building an RNA-seq pipeline so I'm not entirely sure what are the best tools to use.

Thank you!

alexdobin commented 2 years ago

Hi Gabrielle,

I think the issue is that the GTF file contains space-separated "Pseudomonas Genome DB" in the 2nd column, which confuses STAR. Please replace it with the single word (no spaces name) and it will hopefully work.

Cheers Alex

gcabebe commented 2 years ago

I renamed all values in the 2nd column without spaces, and I'm now getting the following error message:

Fatal INPUT FILE error, no valid exon lines in the GTF file: /home/gcabebe/rnaseq/refseqKT2440/gtf/Pseudomonas_putida_KT2440_110.gtf 

Solution: check the formatting of the GTF file. One likely cause is the difference in chromosome naming between GTF and FASTA file. 

Jun 21 11:37:22 ...... FATAL ERROR, exiting 

I downloaded the genomic FASTA and GTF file from the same location (here, bottom right).

alexdobin commented 2 years ago

Hi Gabrielle,

could you please send me the updated GTF, or a few lines from it?

gcabebe commented 2 years ago

Below is a preview of the first few lines, but I also attached the whole updated GTF file as a txt document.

chromosome      PseudomonasGenomeDB     CDS     147     1019    .       -       0       gene_id "PGD38821096"; transcript_id "PGD38821096"; locus_tag "PP_0001"; name "chromosome-partitioning protein"; replicon_xref "NC_002947.4";
chromosome      PseudomonasGenomeDB     CDS     1029    1820    .       -       0       gene_id "PGD38821098"; transcript_id "PGD38821098"; locus_tag "PP_0002"; name "chromosome partition protein"; replicon_xref "NC_002947.4";
chromosome      PseudomonasGenomeDB     CDS     1839    2489    .       -       0       gene_id "PGD38821100"; transcript_id "PGD38821100"; locus_tag "PP_0003"; name "16S RNA methyltransferase"; replicon_xref "NC_002947.4";
chromosome      PseudomonasGenomeDB     CDS     5012    6382    .       -       0       gene_id "PGD38821104"; transcript_id "PGD38821104"; locus_tag "PP_0005"; name "GTPase"; replicon_xref "NC_002947.4";
chromosome      PseudomonasGenomeDB     CDS     6471    8153    .       -       0       gene_id "PGD38821106"; transcript_id "PGD38821106"; locus_tag "PP_0006"; name "membrane protein insertase"; replicon_xref "NC_002947.4";
chromosome      PseudomonasGenomeDB     CDS     8156    8401    .       -       0       gene_id "PGD38821108"; transcript_id "PGD38821108"; locus_tag "PP_0007"; name "hypothetical protein"; replicon_xref "NC_002947.4";
chromosome      PseudomonasGenomeDB     CDS     8812    8946    .       -       0       gene_id "PGD38821112"; transcript_id "PGD38821112"; locus_tag "PP_0009"; name "50S ribosomal protein L34"; replicon_xref "NC_002947.4";
chromosome      PseudomonasGenomeDB     CDS     9542    11062   .       +       0       gene_id "PGD38821114"; transcript_id "PGD38821114"; locus_tag "PP_0010"; name "chromosomal replication initiator protein"; replicon_xref "NC_002947.4";
chromosome      PseudomonasGenomeDB     CDS     11103   12206   .       +       0       gene_id "PGD38821116"; transcript_id "PGD38821116"; locus_tag "PP_0011"; name "DNA polymerase III subunit beta"; replicon_xref "NC_002947.4";
chromosome      PseudomonasGenomeDB     CDS     12222   13325   .       +       0       gene_id "PGD38821118"; transcript_id "PGD38821118"; locus_tag "PP_0012"; name "DNA replication/repair protein"; replicon_xref "NC_002947.4";

PputidaKT2440gtf.txt

alexdobin commented 2 years ago

Thanks! If all of the entries in col.3 are CDS, then you also need --sjdbGTFfeatureExon CDS

gcabebe commented 2 years ago

I have been using that but I still get the same error message as above. Alternatively, I tried renaming CDS with exon in the third column and removing --sjdbGTFfeatureExon CDS and ended up with the exact same error message as above.

gcabebe commented 2 years ago

If it helps, here's an attachment of the FASTA file I was using and a preview of it below. Maybe I need to rename the first column of the GTF to match with the formatting of the FASTA?

>refseq|NC_002947.4|chromosome Pseudomonas putida KT2440 chromosome, complete genome. length=6181873;assembly=GCF_000007565.2
AACTGCTCCTCGGAAGTCGACCAACAAGTCAGCTATGACTTGGCATAATTTGTGCCGACA
AAATGCGCGCAGAGTATAGGGGTGGATTAACCCCTATTCAACTCTTCGGTAGTGATTTCC
GACTTCACGCTACAACAGGAATTGTTTCAGCGGATGTGAGCGAGCACGCCTTGCAACTCG
TCAAGCGAGTTGTAGCGAATAACCAACTGGCCTTTGCCCTTGTTGCCATGACGGATCTGC
ACGGCCGAGCCCAGGCGCTCTGCGAGCCGCTGTTCAAGGCGTGCGATATCCGGATCAGGT
TTGCTCGGTTCGACCGGATCAGGCTTGTCGCTGAGCCACTGACGGACCAGTGCCTCGGTT
TGGCGCACGGTGAGGCCACGTGCGACAACATGACGCGCCCCCTCCTCCTGACGATTTTCG
TCCAGGCCCAGCAATGCACGGGCGTGCCCCATCTCCAGATCACCGTGGGCGAGCATGGTC
TTGATCGCATCGGGCAAGGTGATGAGGCGCAGCAGGTTGGCCACAGTCACCCGCGACTTG

Pseudomonas_putida_KT2440_110.fna.gz

alexdobin commented 2 years ago

Yes, absolutely - the "chromosome names" in the FASTA file (the string after ">" and before the first space) has to match the first field string in the GTF.

gcabebe commented 2 years ago

I changed "refseq|NC_002947.4|chromosome" from the FASTA file to "chromosome" to match the first column in the GTF file and it worked! Thanks so much for your help.

Rashmiiee commented 4 months ago

I had the same issue as @gcabebe, but for yeast(S288C_reference_genome_R64-5-1_20240529). My genome fasta and gtf files look like this. I swapped out the first term on the reference genome file to 'chr1' to match the first column on gtf fille and it worked.

Screenshot 2024-06-15 at 8 49 49 PM Screenshot 2024-06-15 at 8 50 34 PM