BrooksLabUCSC / flair

Full-Length Alternative Isoform analysis of RNA
Other
203 stars 69 forks source link

junctions_from_sam creates invalid BEDs that are silently ignored #272

Closed diekhans closed 11 months ago

diekhans commented 11 months ago

junctions_from_sam creates bed file like:

chr1    6419703 6419775 0       1
chr1    8864093 8865284 0       1
chr1    8865483 8866278 0       1
chr1    8866502 8867116 0       1
chr1    8867251 8867987 0       1
chr1    8868058 8871890 0       1

note, these lack strand.

ssCorrect.py then hallucinates STAR junction files because the name is '0' and silently rejects them. In the end, no splice junctions are added as evidence.

This fact that no splice junctions are added is lost in the stream of unhelpful progress messages.

Essentially, if the documentation is followed, short reads junctions don't work

Jeltje commented 11 months ago

It's not trying to create bed but STAR junction format. Looking into why that doesn't get read properly...

diekhans commented 11 months ago

I am so silly, thinking it produced BED, given the name and the documentation. I shouldn't be so naive!

It does "guess" (another ticket) that is a STAR a file, but then discards them all because of something to do with strand.

Jeltje @.***> writes:

It's not trying to create bed but STAR junction format. Looking into why that doesn't get read properly...

-- Reply to this email directly or view it on GitHub: https://github.com/BrooksLabUCSC/flair/issues/272#issuecomment-1692270815 You are receiving this because you authored the thread.

Message ID: @.***> It's not trying to create bed but STAR junction format. Looking into why that doesn't get read properly...

-- Reply to this email directly, [1]view it on GitHub, or [2]unsubscribe. You are receiving this because you authored the thread. Message ID: @.***>

References

  1. https://github.com/BrooksLabUCSC/flair/issues/272#issuecomment-1692270815
  2. https://github.com/notifications/unsubscribe-auth/AAA23ZO7BVZN744IWTYLQILXW6R4BANCNFSM6AAAAAA3R22XVY
diekhans commented 11 months ago

Ah yes, the program name does not claim bed, it is the doc and output file name *_junctions.bed - Junctions in BED format

The reason the result the star junction files are ignored is in ssCorrect.py is in addOtherJuncs(); Here the code examines the strand

            if starOffset:
                if strand == "1": strand = "+"
                elif strand == "2": strand = "-"
                else:
                    continue

and all of the junctions_from_sam output has a strand of `0'

:-(

diekhans commented 11 months ago

The root cause of this problem is that the pipeline that aligned the short reads did not output the XS tag:

STAR --runThreadN 8 --genomeDir genomeDir --readFilesIn H02C_8986_GATCAG_read1.fastq.gz H02C_8986_GATCAG_read2.fastq.gz --readFilesCommand pigz -p4 -dc --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD --outSAMunmapped Within --outSAMattrRGline ID:H02C_8986_GATCAG PU:H02C_8986_GATCAG SM:H0CRep1 PL:ILLUMINA CN:CRG --outFilterType BySJout --outFilterMultimapNmax 10 --outFilterMismatchNmax 999 -

so junctions_from_sam creates a not-really-a-bed file with no strand and no error and then ssCorrect ignores them all.

Jeltje commented 11 months ago

Should be fixed.