dominikburri commented 4 years ago

When working with stranded, paired end RNA-seq, ALFA does not correctly report reads on categories.

Example

I copied the supplied quick_start.chr_len.txt,quick_start.gtf and quick_start.bam into a new directory.

I adjusted the supplied quick_start.bam for a stranded paired end RNA-seq (firstread antisense) by:

doubling the first 5 reads and removing the last 5
changing SAM flags from 0 to 163 and 83
column 7 from * to =

column 8 from 0 to position of mate and saved it as paired_end.bam. Such a result can be expected when using STAR for mapping:

HWI-1KL110:162:C7268ACXX:8:1105:18693:99774 163 Chr1    300 255 50M =   300 0   ATACCAAACCAGAGAAAACTAATACATAATCGCAGAAATACAGATTACGG  @CCFFFFFHHHHHJJJJJJ+CGHIJJJJJJIJJJIJJJJJJJJJJJIJJJ  NH:i:1  HI:i:1  AS:i:43 nM:i:2
HWI-1KL110:162:C7268ACXX:8:1105:18693:99774 83  Chr1    300 255 50M =   300 0   ATACCAAACCAGAGAAAACTAATACATAATCGCAGAAATACAGATTACGG  @CCFFFFFHHHHHJJJJJJ+CGHIJJJJJJIJJJIJJJJJJJJJJJIJJJ  NH:i:1  HI:i:1  AS:i:43 nM:i:2
HWI-1KL110:162:C7268ACXX:8:1110:3857:68388  163 Chr1    300 255 50M =   300 0   ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG  CCCFFFFFGHHHHJJJJJJJJJJJIIIJJJEHJJJJJJJJIJJJJJIIJJ  NH:i:1  HI:i:1  AS:i:45 nM:i:1
HWI-1KL110:162:C7268ACXX:8:1110:3857:68388  83  Chr1    300 255 50M =   300 0   ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG  CCCFFFFFGHHHHJJJJJJJJJJJIIIJJJEHJJJJJJJJIJJJJJIIJJ  NH:i:1  HI:i:1  AS:i:45 nM:i:1
HWI-1KL110:162:C7268ACXX:8:1206:17943:50608 163 Chr1    300 255 50M =   300 0   ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG  CCCFFFFFHHHHHJJJJJJIJJJJJJJIIJJIJJJJJIJJJJIJJJJIJJ  NH:i:1  HI:i:1  AS:i:45 nM:i:1
HWI-1KL110:162:C7268ACXX:8:1206:17943:50608 83  Chr1    300 255 50M =   300 0   ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG  CCCFFFFFHHHHHJJJJJJIJJJJJJJIIJJIJJJJJIJJJJIJJJJIJJ  NH:i:1  HI:i:1  AS:i:45 nM:i:1
HWI-1KL110:162:C7268ACXX:8:1215:9399:59794  163 Chr1    500 255 50M =   500 0   ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG  @<@DDBDDFBDHHIEHGGEHICHIIIGHHIEG8DG@?FDEGGFHIGHGJJ  NH:i:1  HI:i:1  AS:i:45 nM:i:1
HWI-1KL110:162:C7268ACXX:8:1215:9399:59794  83  Chr1    500 255 50M =   500 0   ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG  @<@DDBDDFBDHHIEHGGEHICHIIIGHHIEG8DG@?FDEGGFHIGHGJJ  NH:i:1  HI:i:1  AS:i:45 nM:i:1
HWI-1KL110:162:C7268ACXX:8:2101:19036:57639 163 Chr1    500 255 50M =   500 0   ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG  CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJ  NH:i:1  HI:i:1  AS:i:45 nM:i:1
HWI-1KL110:162:C7268ACXX:8:2101:19036:57639 83  Chr1    500 255 50M =   500 0   ATACCAAACCAGAGAAAACAAATACATAATCGCAGAAATACAGATTACGG  CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJ  NH:i:1  HI:i:1  AS:i:45 nM:i:1

I genereted alfa_index as described: alfa -a quick_start.gtf --chr_len quick_start.chr_len.txt

Output with ALFA from .bam

running alfa -g quick_start --bam paired_end.bam paired_end -s reverse produces a plot where half of the reads are considered on the opposite strand: ALFA_plots.Categories.pdf

Generating .bedgraph from STAR

When I generate .bedgraph on my own with STAR --runMode inputAlignmentsFromBAM --inputBAMfile paired_end.bam --outWigType bedGraph --outWigStrand Stranded --outWigNorm None I get 4 files, of which 2 I need.

With this example I get an empty Signal.UniqueMultiple.str1.out.bg which I have to populate with "chr1\t100\t101\t0", so I do not get an error when running ALFA.

Also, I need to rename the files:

mv Signal.UniqueMultiple.str1.out.bg Signal.UniqueMultiple.out.minus.bg
mv Signal.UniqueMultiple.str2.out.bg Signal.UniqueMultiple.out.plus.bg

Output ALFA from .bedgraph

Now running ALFA from bedgraph with alfa -g quick_start --bedgraph Signal.UniqueMultiple.out.plus.bg Signal.UniqueMultiple.out.minus.bg paired_end_STAR -s reverse gives me as result the correct annotation: ALFA_plots.Categories.pdf

Problems

ALFA does not produce the correct proportions of reads when using stranded, paired end data.
The documentation on strandness is very sparse and confusing. In alfa.py when generating bedgraph from bam the parameters for reverse are set such that the minus strand is called plus and vice versa. Why?
Working with ALFA with own .bedgraph is tedious because manual renaming of files is necessary (requirement to have .plus. and .minus. in the filename).

Possible source of problem

ALFA uses basically bedtools genomecov with option -strand to generate .bedgraph. It could be that bedtools does ignore the SAM flags related to mates (bits: 0x40 and 0x80). Therefore read pairs are separated into different strands instead of keeping them on one strand.

mbahin commented 4 years ago

Hi Dominik,

Thanks for your detailed issue.

We have been having this discussion of integrating paired-end data processing and since both of us (authors) weren't very experienced with this kind of data, we decided not to manage it.

I don't have that much time to deal with this now but your opinion about the topic is interesting. At first glance, my intuition about the source of the problem is the same as yours.

I'm still not used to work with paired-end data but, if you are, what do you think about artificially extending the first mate on its strand up to the beginning of the second mate? Thus we could consider this as a single-end read and count its genomic features. Doing this, we would describe what was really present in the sample at the beginning (the fragment marked by the paired-end reads).

Cheers, Mathieu

dominikburri commented 4 years ago

Hi Mathieu,

I would need to think a bit more about this question, but artificially generating coverage does not sound right to me.

The procedure I suggested that uses STAR to generate .bedgraph coverage tracks works. It also has the advantage that coverage on normalised reads (reads per million) can be generated. You could simply recommend this procedure in your documentation.

On another note: I noticed that you can supply multiple samples with

alfa -g quick_start --bedgraph sample1.out.plus.bg sample1.out.minus.bg sample1 sample2.out.plus.bg sample2.out.minus.bg sample2 -s reverse

and it generates plots with both samples. However, the results of sample2 are exactly the same as for sample1. Maybe you could state more explicitly that only one sample can be used when using --bedgraph option? Or return an error if multiple samples are supplied.

Best, Dominik