CraigIDent / SpliSER

Bioinformatic tool for Splice site Strength Estimation using RNA-seq
15 stars 5 forks source link

Problems of generating splice junctions files from regtools #6

Open santataRU opened 2 weeks ago

santataRU commented 2 weeks ago

I have RNA-seq data from two conditions (control and splicing factor knockout) and would like to use the SpliSER tool to analyze differential splice site usage between these conditions. To do so, I need to generate splice junction files in .bed format, which will serve as part of the input for the SpliSER tool.

I'm using regtools 0.5.0 for data from HISAT2, and I know the strandedness of my RNA-seq data is fr-firststrand/RF. According to the regtools options, I tried using regtools junctions extract -s 1 indexed_alignments.bam. In the 6th column of the output .bed file, I see some junctions labeled with "+", some with "-", and some with "?". I observe an apparently random alternation of + and - in the 6th column of the bed file.

I also tried using regtools junctions extract -s 0 indexed_alignments.bam, since someone from the forum suggested this option will make regtools use the -XS tag from the bam files of HISAT2 alignment, but I see similar patterns in the 6th column of the output.bed file, with some junctions labeled with "+", some with "-", and some with "?". I'm unsure what the correct approach is in this case.

I appreciate any of your input.

CraigIDent commented 1 week ago

Hi santata, I've had this issue too with some alignments (from Tophat2), where I couldn't get Regtools to play well with stranded paired-end data. I could recommend trying the latest version of Regtools (v1.0.0). Maybe you'll have more luck.

I'll leave this issue open, I've been thinking about alternative pipelines, for times when Regtools doesn't seem to work. Will keep you updated.

santataRU commented 1 week ago

Hi santata, I've had this issue too with some alignments (from Tophat2), where I couldn't get Regtools to play well with stranded paired-end data. I could recommend trying the latest version of Regtools (v1.0.0). Maybe you'll have more luck.

I'll leave this issue open, I've been thinking about alternative pipelines, for times when Regtools doesn't seem to work. Will keep you updated.

Dear CraigIDent,

Thank you for your suggestions. I identified the cause of the issue. I was using the Partek Flow platform to align my RNA-seq reads with the HISAT2 aligner. The default setting in Partek Flow is unstranded, so the output BAM files do not contain the "XS" attributes. There are two options in HISAT2 that generate "XS" attributes in the output BAM files: the --dta option and the --rna-strandness option.

I tried both options, and while both produced BAM files with "XS" attributes, the --dta option resulted in significantly fewer alignments. This aligns with the HISAT2 manual, which notes that the --dta option is tailored for transcript assemblers, such as StringTie. With this option, HISAT2 requires longer anchor lengths for de novo discovery of splice sites, which leads to fewer alignments. In my case, the --rna-strandness option worked well for both versions of Regtools (v1.0.0 and v0.0.5).

For Regtools v1.0.0, the appropriate command is:

regtools junctions extract -s RF sorted_indexed_alignments.bam

whereas, for v0.0.5, the command is:

regtools junctions extract -s 1 sorted_indexed_alignments.bam

Both versions of Regtools produced the same results on my test BAM file.

I did notice that the Regtools junction extract tool (in both versions v0.0.5 and v1.0.0) counts overlapping reads as double support for a junction, which has been discussed in this post: https://github.com/griffithlab/regtools/issues/187. According to the developer's response in that thread, this issue has been addressed in the latest update (v1.1.0). However, I have been unable to find a way to download the latest update of Regtools.

Thank you again for your help.

Best regards, Xiao

santataRU commented 5 days ago

Hi santata, I've had this issue too with some alignments (from Tophat2), where I couldn't get Regtools to play well with stranded paired-end data. I could recommend trying the latest version of Regtools (v1.0.0). Maybe you'll have more luck.

I'll leave this issue open, I've been thinking about alternative pipelines, for times when Regtools doesn't seem to work. Will keep you updated.

Dear CraigIDent,

While Regtools (v1.0.0) seems to work with my test BAM file generated with HISAT2 from a test FASTQ file containing only 10 reads, it malfunctions when processing the actual BAM file from the full FASTQ dataset. The output .bed file contains a total of 606,976 junctions, with 223,254 marked with "-", 231,447 marked with "+", and 152,275 marked with "?".

I am encountering the same issue as I previously posted, so I'll leave this open for now.

Best regards, Xiao