biocompibens / ALFA

ALFA: Annotation Landscape for Aligned Reads
MIT License
14 stars 2 forks source link

Paired end sequencing does not give correct annotation #3

Open dominikburri opened 4 years ago

dominikburri commented 4 years ago

When working with stranded, paired end RNA-seq, ALFA does not correctly report reads on categories.

Example

I copied the supplied quick_start.chr_len.txt,quick_start.gtf and quick_start.bam into a new directory.

I adjusted the supplied quick_start.bam for a stranded paired end RNA-seq (firstread antisense) by:

I genereted alfa_index as described: alfa -a quick_start.gtf --chr_len quick_start.chr_len.txt

Output with ALFA from .bam

running alfa -g quick_start --bam paired_end.bam paired_end -s reverse produces a plot where half of the reads are considered on the opposite strand: ALFA_plots.Categories.pdf

Generating .bedgraph from STAR

When I generate .bedgraph on my own with STAR --runMode inputAlignmentsFromBAM --inputBAMfile paired_end.bam --outWigType bedGraph --outWigStrand Stranded --outWigNorm None I get 4 files, of which 2 I need.

With this example I get an empty Signal.UniqueMultiple.str1.out.bg which I have to populate with "chr1\t100\t101\t0", so I do not get an error when running ALFA.

Also, I need to rename the files:

mv Signal.UniqueMultiple.str1.out.bg Signal.UniqueMultiple.out.minus.bg
mv Signal.UniqueMultiple.str2.out.bg Signal.UniqueMultiple.out.plus.bg

Output ALFA from .bedgraph

Now running ALFA from bedgraph with alfa -g quick_start --bedgraph Signal.UniqueMultiple.out.plus.bg Signal.UniqueMultiple.out.minus.bg paired_end_STAR -s reverse gives me as result the correct annotation: ALFA_plots.Categories.pdf

Problems

Possible source of problem

ALFA uses basically bedtools genomecov with option -strand to generate .bedgraph. It could be that bedtools does ignore the SAM flags related to mates (bits: 0x40 and 0x80). Therefore read pairs are separated into different strands instead of keeping them on one strand.

mbahin commented 4 years ago

Hi Dominik,

Thanks for your detailed issue.

We have been having this discussion of integrating paired-end data processing and since both of us (authors) weren't very experienced with this kind of data, we decided not to manage it.

I don't have that much time to deal with this now but your opinion about the topic is interesting. At first glance, my intuition about the source of the problem is the same as yours.

I'm still not used to work with paired-end data but, if you are, what do you think about artificially extending the first mate on its strand up to the beginning of the second mate? Thus we could consider this as a single-end read and count its genomic features. Doing this, we would describe what was really present in the sample at the beginning (the fragment marked by the paired-end reads).

Cheers, Mathieu

dominikburri commented 4 years ago

Hi Mathieu,

I would need to think a bit more about this question, but artificially generating coverage does not sound right to me.

The procedure I suggested that uses STAR to generate .bedgraph coverage tracks works. It also has the advantage that coverage on normalised reads (reads per million) can be generated. You could simply recommend this procedure in your documentation.

On another note: I noticed that you can supply multiple samples with

alfa -g quick_start --bedgraph sample1.out.plus.bg sample1.out.minus.bg sample1 sample2.out.plus.bg sample2.out.minus.bg sample2 -s reverse

and it generates plots with both samples. However, the results of sample2 are exactly the same as for sample1. Maybe you could state more explicitly that only one sample can be used when using --bedgraph option? Or return an error if multiple samples are supplied.

Best, Dominik

mbahin commented 4 years ago

Hi Dominik,

Thanks for your suggestions, I'll take care of that when I have time to deal with that. I'll notify here.

Cheers, Mathieu