Junction support numbers

KuechlerO commented 2 years ago

Hi, thx again for your great work with Slinker! ^^

I've been playing around with it, and since I have some use-cases, I tried to compare Slinker's output with the IGV-Browser's output. However, I noticed, that for my case-sample, the number of displayed supporting split-reads for its junctions were differening from the numbers displayed with IGV. But for the control-samples everything seemed alright. (It should be 34 for the control, but 31 & 25 for the test sample, instead of 28 & 67, which were produced by Slinker.)

Could you please provide me with some more information, how exactly the junction count numbers are derived? Thank you so much in advance!

Here's the Slinker output: Slinker_result

And here's the IGV-output, whereby the bottom sample is the control sample: IGV_result

I also continued digging into your code, but was not really able to grasp the whole process. However, I noticed some things in tools/Canvas/Canvas/track/junction.py: In the lines following to """ Taken from Alex Dobin's excellent STAR aligner SJ script - reimplemented in Python.""", I think there are some errors:

Variable tis set, but never used

There is one if statement, which is not reachable:

if (op_type == 4) or (op_type == 1):
t += op_length
elif op_type == 4:      # TODO already covered by the above?
    g += op_length

Thanks again for your work & best regards! :)

breons commented 2 years ago

Hi Oliver,

That looks cool! Glad you got it running on your data.

Just messaging to say I've read this and will run some tests to see what's up. But I am just a bit swamped over the next day or two.

In the mean time I'm curious to know if you ran the below whether you get similar results to IGV or Slinker (just feed in the SAM files, i.e. awk -f below-script.awk | sort -V > junctions.txt)

I notice the highlighting is off too - those should be truncated exons. I will look into that given your output. Ideally I should have some unit tests for this. Thanks for sharing!

Finally, I'll say that you have some options to tweak that I think can help remove transcripts with small coverage (likely few of the assembled are of less interest than others). These aren't documented yet.

-p tpm=int and -p c=int will filter out transcripts that do not exceed these set thresholds. They marry up to the parameters set in stringtie2 (https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual). You will need to delete everything and rerun the pipeline to apply these each time at this stage, but it's pretty quick. You might find this removes some redundant transcripts as far as visualising this event is concerned.

Thanks again, shall get to it soon. Breon.

KuechlerO commented 2 years ago

In the mean time I'm curious to know if you ran the below whether you get similar results to IGV or Slinker (just feed in the SAM files, i.e. awk -f below-script.awk | sort -V > junctions.txt)

I have done as you requested for my original 19-5325_1.bam-file, and the produced outputs for both scripts agree with the IGV-output, meaning the two junctions of interest have been recognized successfully with agreeing outputs (with an offset of +1 for the start-coordinate in comparison to IGV):

Normal junction: 31 supporting split reads
Aberrant junction: 25 supporting split reads

breons commented 2 years ago

Hi Oliver, sorry about the delay here.

Thanks for that. Slinker should be trying to replicate the results from the first script listed above. Given you've matched those with your IGV view indicates a problem with Slinker.

As you mentioned, there was one statement that was not reachable. This was an oversight and has been fixed in a recent update. Either pull the latest change (making sure that you don't overwrite your local environment information like in tools.groovy), or simply just change your above snippet to:

if (op_type == 4) or (op_type == 1):
    t += op_length
elif op_type == 2:  
        g += op_length

I don't see how the above will resolve your problem as this has the same effect in the end. In terms of the unused t variable, the script does the same. I kept it in just in case.

I've created a standalone script for testing my implementation of Alex's script.

Basically you run it like: python $PATHTO/junctions_test.py bam_file.bam MSTRG.1 Where MSTRG.1 is the superTranscript reference name that you can get with a samtools idxstats bam_file.bam

It'll output a file with all the junctions that Slinker generates.

One thing that I noticed differed between my implementation and Alex's was a check to see whether a read was adding the same junction twice. I've implemented a solution here which results in fewer junctions than if I run the script above, but also fewer than Alex's scripts.

Keen to see if either script is closer in your scenario.

Oshlack / Slinker

Junction support numbers #5