Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
368 stars 81 forks source link

STAR vs HiSat2 for RNA-seq data in Braker3 #585

Open JohnUrban opened 1 year ago

JohnUrban commented 1 year ago

Hi again,

For Braker1, I had been providing BAM files produced by STAR.

Since you are now using StringTie assembly as part of the Braker pipeline, I was wondering if there are any notable performance differences when providing BAM files mapped with STAR vs BAM files mapped with HiSat2 (or providing the Fastq files instead so Braker can use HiSat2).

I also noticed that when providing the STAR BAMs as --stranded=+,- --bam=${FWD},${REV}, StringTie is used to produce a forward strand assembly and a reverse strand assembly -- and these are subsequently merged. This seems reasonable (although can't you just tell StringTie it is strand-specific). I have a few questions here:

  1. Is there a way to provide strand information when using FASTQs -- e.g. --stranded=+,- --rnaseq_sets_dirs=${FWD},${REV}?

  2. I am using single-end reads (not paired-end -- and I am not sure why the experimenters made this decision, but c'est la vie). Given the strand-specific RNA-seq protocol, read1 is always on the negative strand with respect to the actual RNA sequence. So for my single-end data, that means the reads in FWD and REV map to the FWD and REV strands with respect to the genome sequence, but the RNA sequence/direction is actually on the opposite strands (REV and FWD, respectively). How might this affect the StringTie assembly steps? If the appropriate parameter is passed to StringTie, I know it is able to anticipate and account for this. If StringTie is not given the appropriate parameter for this strand information, then I am wondering if it would be attempting to assembly transcripts on the wrong strand...? Perhaps not if it is using splice-junctions to help learn the gene direction....

Note that I am also wondering if some of my questions above may have relevance to issues we are facing in #577 and #582. E.g. if StringTie is assembling funky transcripts for any reason, perhaps that could cause issues that appear like "not enough data" for the GeneMark errors we are seeing.

Best,

John

JohnUrban commented 1 year ago

Update: providing a raw fastq, instead of a STAR produced BAM, solved GeneMark-ETP issues we were having. See #582.

JohnUrban commented 1 year ago

If Braker3 is only using the RNA-seq data to produce a StringTie assembly, then an option to just pass Braker3 the StringTie assembly (or any reference-guided transcriptome assembly) would be great in my opinion.

That would potentially bypass the "STAR BAM" problem as well as allow users to provide strand-specific transcriptome assemblies -- i.e. those produced by:

hisat2 --rna-strandness RF -p ${THREADS} --dta -x ${HIDX} -1 ${R1} -2 ${R2} -S ${R}.sam
samtools sort -@ ${THREADS} -o ${R}.bam ${R}.sam
stringtie --rf -p ${THREADS} -o ${R}.gtf -l ${R} ${R}.bam
KatharinaHoff commented 1 year ago

This issue affects both BRAKER and GeneMark-ETP. I will not touch it in BRAKER before it's fixed in GeneMark-ETP.