alexdobin / STAR

RNA-seq aligner
MIT License
1.77k stars 497 forks source link

Selection of GTF file from gencode #2081

Open SueFletcher opened 4 months ago

SueFletcher commented 4 months ago

Hello, I'm a first-year master's student, and I'm attempting to use STAR to index the mouse genome. I'm using the following command:

import os import subprocess

class STAR: def init(self, genome_dir, genome_fasta_files, sjdb_gtf_file, runThreadN): self.exec_path = "/opt/conda/envs/STAR/bin/STAR" self.genome_dir = genome_dir self.genome_fasta_files = genome_fasta_files self.sjdb_gtf_file = sjdb_gtf_file self.runThreadN = runThreadN

def build_genome_index(self):
    # Create the genome_dir directory if it doesn't exist
    os.makedirs(self.genome_dir, exist_ok=True)

    cmd = [
        self.exec_path,
        "--runMode", "genomeGenerate",
        "--runThreadN", str(self.runThreadN),
        "--genomeChrBinNbits", "12",
        "--limitGenomeGenerateRAM", "60000000000",
        "--genomeDir", self.genome_dir,
        "--genomeFastaFiles", self.genome_fasta_files,
        "--sjdbGTFfile", self.sjdb_gtf_file,
        "--genomeSAsparseD", "3"
    ]
    subprocess.check_call(cmd)

genome_dir = "/desktop/output/mouse_genome_index/" genome_fasta_files = "/desktop/mouse_input_data/mouse_gencode_transcripts.fa" sjdb_gtf_file = "/desktop/mouse_input_data/mouse_gencode_annotation.gtf" runThreadN = 8

star = STAR(genome_dir, genome_fasta_files, sjdb_gtf_file, runThreadN) star.build_genome_index()

I downloaded the mouse genome FASTA and GTF files from the GENCODE website : https://www.gencodegenes.org/mouse/ I used the following GTF file

image

and this fasta file: image

However, I encountered an error that I'm having trouble understanding: /opt/conda/envs/STAR/bin/STAR-avx2 --runMode genomeGenerate --runThreadN 8 --genomeChrBinNbits 12 --limitGenomeGenerateRAM 60000000000 --genomeDir /desktop/output/mouse_genome_index/ --genomeFastaFiles /desktop/mouse_input_data/mouse_gencode_transcripts.fa --sjdbGTFfile /desktop/mouse_input_data/mouse_gencode_annotation.gtf --genomeSAsparseD 3 STAR version: 2.7.11b compiled: 2024-01-29T15:15:38+0000 :/opt/conda/conda-bld/star_1706541070242/work/source Feb 29 15:51:23 ..... started STAR run Feb 29 15:51:23 ... starting to generate Genome files Feb 29 15:51:29 ..... processing annotations GTF

Fatal INPUT FILE error, no valid exon lines in the GTF file: /desktop/mouse_input_data/mouse_gencode_annotation.gtf Solution: check the formatting of the GTF file. One likely cause is the difference in chromosome naming between GTF and FASTA file.

Feb 29 15:51:32 ...... FATAL ERROR, exiting Traceback (most recent call last): File "mouse_star_index.py", line 39, in star.build_genome_index() File "mouse_star_index.py", line 31, in build_genome_index subprocess.check_call(cmd) File "/usr/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/opt/conda/envs/STAR/bin/STAR', '--runMode', 'genomeGenerate', '--runThreadN', '8', '--genomeChrBinNbits', '12', '--limitGenomeGenerateRAM', '60000000000', '--genomeDir', '/desktop/mouse_genome_index/', '--genomeFastaFiles', '/desktop/mouse_input_data/mouse_gencode_transcripts.fa', '--sjdbGTFfile', '/desktop/mouse_input_data/mouse_gencode_annotation.gtf', '--genomeSAsparseD', '3']' returned non-zero exit status 104.

it is related to the GTF file, but I don't know which GTF file I have to download from gencode in this case ( --sjdbGTFfile )

alexdobin commented 4 months ago

You need to use the PRI fasta file (genome sequences, not transcriptome): https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M34/GRCm39.primary_assembly.genome.fa.gz You can also use the PRI GTF file which has more comprehensive annotations than the basic.

SueFletcher commented 4 months ago

GRCm39.primary_assembly.genome.fa.gz does it this one ? : https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M34/gencode.vM34.primary_assembly.annotation.gtf.gz

alexdobin commented 3 months ago

Yes, correct!