Putnam-Lab / Lab_Management

13 stars 7 forks source link

Issues generating STAR output due to gff naming of transcript IDs #13

Closed JillAshey closed 3 years ago

JillAshey commented 3 years ago

Attempting to build genome index so that genes can be mapped to reference genome for Pdamicornis. Reference genome and gff file from Reef Genomics

module load STAR/2.5.3a-foss-2016b

STAR --runThreadN 10 --runMode genomeGenerate --genomeDir /data/putnamlab/jillashey/Francois_data/output/STAR/GenomeIndex_pdam_rgGFF/ --genomeFastaFiles /data/putnamlab/jillashey/genome/Pdam/pdam_scaffolds.fasta --sjdbGTFfile /data/putnamlab/jillashey/genome/Pdam/pdam_annotation.gff3

Did not work with this gff, gave me this error:

terminate called after throwing an instance of 'std::out_of_range' what(): vector::_M_range_check: __n (which is 0) >= this->size() (which is 0) Aborted 

Put transcript ID in the gff file by taking it from the ID in the attribute column in R

Load libraries
library(tidyverse)

pdam.gff <- read.csv(file="~/Desktop/pdam_annotation.gff3", header=FALSE, sep="\t", skip=2) 

#rename columns
colnames(pdam.gff) <- c("scaffold", "Gene.Predict", "id", "gene.start","gene.stop", "pos1", "pos2","pos3", "gene")

# seeing what kinds of components are in annotation
unique(pdam.gff$id)

# Making ref_gene_id, moving gene id into separate column
pdam.gff$gene_id <- sub(";.*", "", pdam.gff$gene)
pdam.gff$gene_id <- gsub("ID=", "transcript_id=", pdam.gff$gene_id) #remove ID= and replace with transcript_id=
# or could do: pdam.gff$gene_id <- gsub("ID=", "gene_id=", pdam.gff$gene_id) #remove ID= and replace with gene_id=
head(pdam.gff)

# paste gene id back into gene column
pdam.gff$gene <- paste(pdam.gff$gene, pdam.gff$gene_id)
head(pdam.gff$gene)

# remove white spaces
pdam.gff$gene <- gsub(" ", "", pdam.gff$gene, fixed = TRUE) #remove white space
head(pdam.gff)

# Remove last column 
pdam.gff <- pdam.gff[,-10]

#save file
write.table(pdam.gff, file="~/Desktop/pdam_annotation_AddTranscript_id_fixed.gtf", sep="\t", col.names = FALSE, row.names=FALSE, quote=FALSE)

This fixed my issue