Closed AHuffmyer closed 1 year ago
I made a revised script to alter the format of the GFF file here: https://github.com/AHuffmyer/EarlyLifeHistory_Energetics/blob/master/Mcap2020/Scripts/TagSeq/Genome_V3/fix_gff_format.Rmd. It is likely that the gff format is the problem - we need both a transcript_id and gene_id in the identifier column of the gff file.
Confirmed that this problem is solved by fixing the GFF file format. Script to do this is here: https://github.com/AHuffmyer/EarlyLifeHistory_Energetics/blob/master/Mcap2020/Scripts/TagSeq/Genome_V3/fix_gff_format.Rmd
I am re-running gene expression analysis using the M. capitata reference genome and functional annotation from Stephens et al. 2022 available from Rutgers Cyanophora.
Has anyone else tried aligning and analyzing TagSeq or RNAseq data to this version?
I am having problems with StringTie using STRG identifiers rather than gene names to sequences that are aligned to the genome. I ran the same code on previous versions of the genome (versions 1 and 2) and did not have this problem, so it is localized to this version of the genome.
Any ideas on how to fix this StringTie problem? Perhaps I am using incorrect options for the merging steps?
Here is my full code:
Mcap Early Life Gene Expression
Genome version V3
Genome publication:
https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac098/6815755
Download, QC, filtering/cleaning done previously. Starting at the alignment step.
1. HISAT2
Obtain reference genome assembly and gff annotation file
Unzip gff and genome file
This creates a .bam file for each sample.
Alignment rates were similar between V1, V2, and V3 (68-72%).
2. Stringtie 2
-p means number of threads/CPUs to use (8 here)
-e means only estimate abundance of given reference transcripts (only genes from the genome) - dont use if using splice variance aware to merge novel and ref based.
-B means enable output of ballgown table files to be created in same output as GTF
-G means genome reference to be included in the merging
This will make a .gtf file for each sample.
3. Prep DE
Rename and move gene count matrix off of server