Closed gcabebe closed 2 years ago
Hi Gabrielle,
I think the issue is that the GTF file contains space-separated "Pseudomonas Genome DB" in the 2nd column, which confuses STAR. Please replace it with the single word (no spaces name) and it will hopefully work.
Cheers Alex
I renamed all values in the 2nd column without spaces, and I'm now getting the following error message:
Fatal INPUT FILE error, no valid exon lines in the GTF file: /home/gcabebe/rnaseq/refseqKT2440/gtf/Pseudomonas_putida_KT2440_110.gtf
Solution: check the formatting of the GTF file. One likely cause is the difference in chromosome naming between GTF and FASTA file.
Jun 21 11:37:22 ...... FATAL ERROR, exiting
I downloaded the genomic FASTA and GTF file from the same location (here, bottom right).
Hi Gabrielle,
could you please send me the updated GTF, or a few lines from it?
Below is a preview of the first few lines, but I also attached the whole updated GTF file as a txt document.
chromosome PseudomonasGenomeDB CDS 147 1019 . - 0 gene_id "PGD38821096"; transcript_id "PGD38821096"; locus_tag "PP_0001"; name "chromosome-partitioning protein"; replicon_xref "NC_002947.4";
chromosome PseudomonasGenomeDB CDS 1029 1820 . - 0 gene_id "PGD38821098"; transcript_id "PGD38821098"; locus_tag "PP_0002"; name "chromosome partition protein"; replicon_xref "NC_002947.4";
chromosome PseudomonasGenomeDB CDS 1839 2489 . - 0 gene_id "PGD38821100"; transcript_id "PGD38821100"; locus_tag "PP_0003"; name "16S RNA methyltransferase"; replicon_xref "NC_002947.4";
chromosome PseudomonasGenomeDB CDS 5012 6382 . - 0 gene_id "PGD38821104"; transcript_id "PGD38821104"; locus_tag "PP_0005"; name "GTPase"; replicon_xref "NC_002947.4";
chromosome PseudomonasGenomeDB CDS 6471 8153 . - 0 gene_id "PGD38821106"; transcript_id "PGD38821106"; locus_tag "PP_0006"; name "membrane protein insertase"; replicon_xref "NC_002947.4";
chromosome PseudomonasGenomeDB CDS 8156 8401 . - 0 gene_id "PGD38821108"; transcript_id "PGD38821108"; locus_tag "PP_0007"; name "hypothetical protein"; replicon_xref "NC_002947.4";
chromosome PseudomonasGenomeDB CDS 8812 8946 . - 0 gene_id "PGD38821112"; transcript_id "PGD38821112"; locus_tag "PP_0009"; name "50S ribosomal protein L34"; replicon_xref "NC_002947.4";
chromosome PseudomonasGenomeDB CDS 9542 11062 . + 0 gene_id "PGD38821114"; transcript_id "PGD38821114"; locus_tag "PP_0010"; name "chromosomal replication initiator protein"; replicon_xref "NC_002947.4";
chromosome PseudomonasGenomeDB CDS 11103 12206 . + 0 gene_id "PGD38821116"; transcript_id "PGD38821116"; locus_tag "PP_0011"; name "DNA polymerase III subunit beta"; replicon_xref "NC_002947.4";
chromosome PseudomonasGenomeDB CDS 12222 13325 . + 0 gene_id "PGD38821118"; transcript_id "PGD38821118"; locus_tag "PP_0012"; name "DNA replication/repair protein"; replicon_xref "NC_002947.4";
Thanks!
If all of the entries in col.3 are CDS, then you also need --sjdbGTFfeatureExon CDS
I have been using that but I still get the same error message as above. Alternatively, I tried renaming CDS with exon in the third column and removing --sjdbGTFfeatureExon CDS
and ended up with the exact same error message as above.
If it helps, here's an attachment of the FASTA file I was using and a preview of it below. Maybe I need to rename the first column of the GTF to match with the formatting of the FASTA?
>refseq|NC_002947.4|chromosome Pseudomonas putida KT2440 chromosome, complete genome. length=6181873;assembly=GCF_000007565.2
AACTGCTCCTCGGAAGTCGACCAACAAGTCAGCTATGACTTGGCATAATTTGTGCCGACA
AAATGCGCGCAGAGTATAGGGGTGGATTAACCCCTATTCAACTCTTCGGTAGTGATTTCC
GACTTCACGCTACAACAGGAATTGTTTCAGCGGATGTGAGCGAGCACGCCTTGCAACTCG
TCAAGCGAGTTGTAGCGAATAACCAACTGGCCTTTGCCCTTGTTGCCATGACGGATCTGC
ACGGCCGAGCCCAGGCGCTCTGCGAGCCGCTGTTCAAGGCGTGCGATATCCGGATCAGGT
TTGCTCGGTTCGACCGGATCAGGCTTGTCGCTGAGCCACTGACGGACCAGTGCCTCGGTT
TGGCGCACGGTGAGGCCACGTGCGACAACATGACGCGCCCCCTCCTCCTGACGATTTTCG
TCCAGGCCCAGCAATGCACGGGCGTGCCCCATCTCCAGATCACCGTGGGCGAGCATGGTC
TTGATCGCATCGGGCAAGGTGATGAGGCGCAGCAGGTTGGCCACAGTCACCCGCGACTTG
Yes, absolutely - the "chromosome names" in the FASTA file (the string after ">" and before the first space) has to match the first field string in the GTF.
I changed "refseq|NC_002947.4|chromosome" from the FASTA file to "chromosome" to match the first column in the GTF file and it worked! Thanks so much for your help.
I had the same issue as @gcabebe, but for yeast(S288C_reference_genome_R64-5-1_20240529). My genome fasta and gtf files look like this. I swapped out the first term on the reference genome file to 'chr1' to match the first column on gtf fille and it worked.
I would like to use STAR to calculate gene expression using RNA-seq data from Pseudomonas putida. The GTF file I'm using has no exon lines so I tried using --sjdbGTFfeatureExon CDS to avoid any "no exon lines in the GTF file" related error.
The GTF file I used can be found here. I checked it and I believe it only contains CDS, ncRNA, tRNA, rRNA features.
Below is how my shell script is formatted to execute this:
Below is the output error I get:
Am I better off using a different sequence alignment tool like bwa or bowtie2, or could this be done on STAR? This is my first time building an RNA-seq pipeline so I'm not entirely sure what are the best tools to use.
Thank you!