Putnam-Lab / Lab_Management

13 stars 7 forks source link

Issues with StringTie STRG gene names when mapping to new version of M. capitata genome #51

Closed AHuffmyer closed 1 year ago

AHuffmyer commented 1 year ago

I am re-running gene expression analysis using the M. capitata reference genome and functional annotation from Stephens et al. 2022 available from Rutgers Cyanophora.

Has anyone else tried aligning and analyzing TagSeq or RNAseq data to this version?

I am having problems with StringTie using STRG identifiers rather than gene names to sequences that are aligned to the genome. I ran the same code on previous versions of the genome (versions 1 and 2) and did not have this problem, so it is localized to this version of the genome.

Any ideas on how to fix this StringTie problem? Perhaps I am using incorrect options for the merging steps?

Here is my full code:

Mcap Early Life Gene Expression

Genome version V3

Genome publication:
https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac098/6815755

Download, QC, filtering/cleaning done previously. Starting at the alignment step.

1. HISAT2

Obtain reference genome assembly and gff annotation file

cd sequences/ 

wget http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.assembly.fasta.gz

wget http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.gff3.gz

Unzip gff and genome file

gunzip sequences/Montipora_capitata_HIv3.genes.gff3.gz
gunzip Montipora_capitata_HIv3.assembly.fasta.gz
nano /data/putnamlab/ashuffmyer/mcap-2020-tagseq/scripts/align_v3.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=20
#SBATCH --export=NONE
#SBATCH --mem=200GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications
#SBATCH --partition=putnamlab                  
#SBATCH --error="alignV3_error" #if your job fails, the error report will be put in this file
#SBATCH --output="alignV3_output" #once your job is completed, any final job report comments will be put in this file
#SBATCH -D /data/putnamlab/ashuffmyer/mcap-2020-tagseq/sequences

# load modules needed
module load HISAT2/2.2.1-foss-2019b
module load SAMtools/1.9-foss-2018b

#unzip reference genome
#gunzip Montipora_capitata_HIv3.assembly.fasta.gz

# index the reference genome for Montipora capitata output index to working directory
hisat2-build -f /data/putnamlab/ashuffmyer/mcap-2020-tagseq/sequences/Montipora_capitata_HIv3.assembly.fasta ./Mcapitata_ref_v3 # called the reference genome (scaffolds)
echo "Referece genome indexed. Starting alingment" $(date)

# This script exports alignments as bam files
# sorts the bam file because Stringtie takes a sorted file for input (--dta)
# removes the sam file because it is no longer needed
array=($(ls clean*)) # call the clean sequences - make an array to align
for i in ${array[@]}; do
        sample_name=`echo $i| awk -F [.] '{print $2}'`
    hisat2 -p 8 --dta -x Mcapitata_ref_v3 -U ${i} -S ${sample_name}.sam
        samtools sort -@ 8 -o ${sample_name}.bam ${sample_name}.sam
            echo "${i} bam-ified!"
        rm ${sample_name}.sam
done
sbatch /data/putnamlab/ashuffmyer/mcap-2020-tagseq/scripts/align_v3.sh

This creates a .bam file for each sample.

Alignment rates were similar between V1, V2, and V3 (68-72%).

2. Stringtie 2

nano /data/putnamlab/ashuffmyer/mcap-2020-tagseq/scripts/stringtie_v3.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=20
#SBATCH --export=NONE
#SBATCH --mem=200GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications
#SBATCH --account=putnamlab                  
#SBATCH --error="stringtieV3_error" #if your job fails, the error report will be put in this file
#SBATCH --output="stringtieV3_output" #once your job is completed, any final job report comments will be put in this file
#SBATCH -D /data/putnamlab/ashuffmyer/mcap-2020-tagseq/sequences

#load packages
module load StringTie/2.1.4-GCC-9.3.0

#Transcript assembly: StringTie

array=($(ls *.bam)) #Make an array of sequences to assemble

for i in ${array[@]}; do #Running with the -e option to compare output to exclude novel genes. Also output a file with the gene abundances
        sample_name=`echo $i| awk -F [_] '{print $1"_"$2"_"$3}'`
    stringtie -p 8 -e -B -G Montipora_capitata_HIv3.genes.gff3 -A ${sample_name}.gene_abund.tab -o ${sample_name}.gtf ${i}
        echo "StringTie assembly for seq file ${i}" $(date)
done
echo "StringTie assembly COMPLETE, starting assembly analysis" $(date)

-p means number of threads/CPUs to use (8 here)

-e means only estimate abundance of given reference transcripts (only genes from the genome) - dont use if using splice variance aware to merge novel and ref based.

-B means enable output of ballgown table files to be created in same output as GTF

-G means genome reference to be included in the merging

sbatch /data/putnamlab/ashuffmyer/mcap-2020-tagseq/scripts/stringtie_v3.sh

This will make a .gtf file for each sample.

3. Prep DE

nano /data/putnamlab/ashuffmyer/mcap-2020-tagseq/scripts/prepDE_v3.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=20
#SBATCH --export=NONE
#SBATCH --mem=200GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications
#SBATCH --account=putnamlab                  
#SBATCH --error="stringtieV3_error" #if your job fails, the error report will be put in this file
#SBATCH --output="stringtieV3_output" #once your job is completed, any final job report comments will be put in this file
#SBATCH -D /data/putnamlab/ashuffmyer/mcap-2020-tagseq/sequences

#load packages
module load Python/2.7.15-foss-2018b #Python
module load StringTie/2.1.4-GCC-9.3.0

#Transcript assembly: StringTie
module load GffCompare/0.12.1-GCCcore-8.3.0

#Transcript assembly QC: GFFCompare

#make gtf_list.txt file
ls AH*.gtf > gtf_list.txt

stringtie --merge -p 8 -G Montipora_capitata_HIv3.genes.gff3 -o Mcapitata_merged.gtf gtf_list.txt #Merge GTFs to form $
echo "Stringtie merge complete" $(date)

gffcompare -r Montipora_capitata_HIv3.genes.gff3 -G -o merged Mcapitata_merged.gtf #Compute the accuracy and pre$
echo "GFFcompare complete, Starting gene count matrix assembly..." $(date)

#make gtf list text file
for filename in AH*.gtf; do echo $filename $PWD/$filename; done > listGTF.txt

python ../scripts/prepDE.py -g Mcapitata_gene_count_matrix_V3.csv -i listGTF.txt #Compile the gene count matrix
echo "Gene count matrix compiled." $(date)
sbatch /data/putnamlab/ashuffmyer/mcap-2020-tagseq/scripts/prepDE_v3.sh

Rename and move gene count matrix off of server

scp ashuffmyer@ssh3.hac.uri.edu:/data/putnamlab/ashuffmyer/mcap-2020-tagseq/sequences/Mcapitata_gene_count_matrix_V3.csv ~/MyProjects/EarlyLifeHistory_Energetics/Mcap2020/Output/TagSeq
AHuffmyer commented 1 year ago

I made a revised script to alter the format of the GFF file here: https://github.com/AHuffmyer/EarlyLifeHistory_Energetics/blob/master/Mcap2020/Scripts/TagSeq/Genome_V3/fix_gff_format.Rmd. It is likely that the gff format is the problem - we need both a transcript_id and gene_id in the identifier column of the gff file.

AHuffmyer commented 1 year ago

Confirmed that this problem is solved by fixing the GFF file format. Script to do this is here: https://github.com/AHuffmyer/EarlyLifeHistory_Energetics/blob/master/Mcap2020/Scripts/TagSeq/Genome_V3/fix_gff_format.Rmd