Alignment of CAGE-seq reads

GrossTor commented 2 years ago

Thank you for all the great work in this repo.

I am trying to reproduce and potentially adapt the pre-processing pipeline for the generation of coverage tracks. I noticed that when using bwa-mem as the alignment method with parameters as suggested in the answer to #88, I can, on average, only align about 15% of the reads in the FANTOM5 CAGE-seq fastq files. In contrast, bowtie2 with default parameters aligns close to 50% (which might still be a low number). Is this expected?

For example, aligning DRR009487.fastq with bwa-mem and then running samtools flagstat on the returned sam files gives:

... 8102937 + 0 mapped (16.58% : N/A) 7169341 + 0 primary mapped (14.96% : N/A) ...

for bowtie2 these numbers are: ... 22687912 + 0 mapped (47.33% : N/A) 22687912 + 0 primary mapped (47.33% : N/A) ...

Also, from a brief look it seems that the aligned reads from bwa-mem are not really a subset of the bowtie2 ones, and that reads that get aligned by both algorithms do not agree in their mapped position. This makes me wonder whether I am using the alignment tools correctly. Do I have to pre-process the fastq files in anyway? (I did notice for example that the CAGE-seq reads are highly enriched with CTAG repeats and I am not sure whether that is a technical artefact that might need to be removed before alignment.)

davek44 commented 2 years ago

Hi, my alignment stats map ~19m reads, so in between those two. My exact bwa arguments were: bwa mem -t 4 -k 17 -w 20 -r 1.2 -c 1000 -h 16 -M -B 3 -O 4 -T 20

GrossTor commented 2 years ago

Thank you very much. I was able to also map ~19m reads with your bwa parameters.

GrossTor commented 2 years ago

Hello @davek44 , my excuses to bother you again with this. I just saw that the cross-species paper mentions you ran fastp prior to alignment with bwa. However, I am running into trouble by using it on fantom5 CAGE-seq data due to read quality scores. From a Fantom5 paper supplement:

Sequenced Heliscope reads have a high sequencing error rate (~5%), vary in length and lack an estimation of base qualities. Combined these factors make the data processing challenging.

In fact, the fantom5 fastq files annotate the same low read quality for all nucleotides, such that all reads get discarded by fastp (using default parameters). Have you or how have you used fastp on the fantom5 fastq files?

Another issues with those fastq files is that the have strange CTAG repeats all over the reads :

I couldn't find any biological nor technical explanation for it. And I am worried that this will make the alignment very difficult (see my initial comment). Do you have an idea where these might come from?

So I am wondering if it would be advantageous to directly use the BAM files provided by fantom5 (e.g. here) instead of bwa-aligning myself. On the downside, muliti-mapped reads will not be annotated (fantom5 BAMs have no "XA" tag) and thus cannot be distributed by the EM algorithm in bam_cov.py. But maybe that is not a big issue if the fantom5-alignment has a better quality as it's been tailored to Heliscope reads. Do you have an opinion on this?

a1ultima commented 2 years ago

Hi,

Thanks for the insights here,

Do you know if it would be possible to adapt this for working with RNA-seq as inputs rather than CAGE-seq?

I'm curious why in the paper there is little mention of RNA-seq as a viable alternative input for alignment.

Cheers,

AB

davek44 commented 2 years ago

GrossTor, looking back at my records, I skipped the fastp step for CAGE and used their files directly. I guess I assumed that they had already QC'd them. The tradeoffs that you suggest for using their BAM files are accurate. I don't have a strong opinion on this. Personally, I looked at their BAM files and felt like there were lots of false positive alignments, but I could be wrong.

AB, I've focused on CAGE-seq because it concentrates the gene expression signal at the TSS where the most informative nucleotides are. This isn't true for RNA-seq, where the aligned reads may be many hundreds of thousands of nucleotides away from the relevant regulatory sequences.

calico / basenji

Alignment of CAGE-seq reads #114