UCSC-LoweLab / tRAX

tRNA Analysis of eXpression
GNU General Public License v3.0
9 stars 5 forks source link

Poor mapping of Cozen data using pipeline #5

Closed emm13 closed 3 years ago

emm13 commented 4 years ago

Dear Andrew,

This one is a weird one. I have been using your pipeline to run samples within the lab. I wanted to see how these compare to your published samples in the Cozen/Lowe paper from 2015 in terms of mappability.

I've downloaded these samples from using this accession in the Sequence read archive SRP056032. I assume these are already processed by SeqPrep as they look like they have been so I thought I'd start with alignments. I chose one WT (SRR1909121.fastq) and one AlkB treated sample (SRR1909118.fastq) and have set up an hg19 database as recommended in the paper.

Command for the AlkB sample is below bowtie2 -x ./hg19-tRNA-db-tRNAgenome -k 100 --very-sensitive --ignore-quals --np 5 --reorder -p 4 -U ./SRR1909118.fastq | ~/00_Software/ARM-Seq/choosemappings.py ./hg19-tRNA-db-trnatable.txt --progname=TRAX --fqname=./SRR1909118.fastq --expname=./samples.txt | samtools sort -T /tmp/GM05372_AlkB_3temp - -o ./BAM/GM05372_AlkB_3.bam

However, the output of this is very strange with hardly any reads aligning (.13-.16%, see below). I feel like I'm missing a trick somewhere - bowtie settings or genome build or fastq files perhaps ?

1241629 reads; of these: 1241629 (100.00%) were unpaired; of these: 1240013 (99.87%) aligned 0 times 1506 (0.12%) aligned exactly 1 time 110 (0.01%) aligned >1 times 0.13% overall alignment rate tRNA Reads with multiple transcripts:0/0 tRNA Reads with multiple anticodons:0/0 tRNA Reads with multiple aminos:0/0 Mappings Removed:417/2459

1300266 reads; of these: 1300266 (100.00%) were unpaired; of these: 1298183 (99.84%) aligned 0 times 1898 (0.15%) aligned exactly 1 time 185 (0.01%) aligned >1 times 0.16% overall alignment rate tRNA Reads with multiple transcripts:0/0 tRNA Reads with multiple anticodons:0/0 tRNA Reads with multiple aminos:0/0 Mappings Removed:502/3090

If you have any suggestions as to what might be the reason for it, please do let me know.

Kind regards, M

andrewdholmes commented 4 years ago

I believe that the ARMSeq data on the SRA has not adapter trimmed, you're going to need to remove them before mapping. We prefer submitting untrimmed reads because the adapter trimming throws some out.

Andrew Holmes

On Fri, Dec 6, 2019 at 7:15 AM Emm13 notifications@github.com wrote:

Dear Andrew,

This one is a weird one. I have been using your pipeline to run samples within the lab. I wanted to see how these compare to your published samples in the Cozen/Lowe paper from 2015 in terms of mappability.

I've downloaded these samples from using this accession in the Sequence read archive SRP056032. I assume these are already processed by SeqPrep as they look like they have been so I thought I'd start with alignments. I chose one WT (SRR1909121.fastq) and one AlkB treated sample (SRR1909118.fastq) and have set up an hg19 database as recommended in the paper.

Command for the AlkB sample is below bowtie2 -x ./hg19-tRNA-db-tRNAgenome -k 100 --very-sensitive --ignore-quals --np 5 --reorder -p 4 -U ./SRR1909118.fastq | ~/00_Software/ARM-Seq/choosemappings.py ./hg19-tRNA-db-trnatable.txt --progname=TRAX --fqname=./SRR1909118.fastq --expname=./samples.txt | samtools sort -T /tmp/GM05372_AlkB_3temp - -o ./BAM/GM05372_AlkB_3.bam

However, the output of this is very strange with hardly any reads aligning (.13-.16%, see below). I feel like I'm missing a trick somewhere - bowtie settings or genome build or fastq files perhaps ?

` 1241629 reads; of these: 1241629 (100.00%) were unpaired; of these: 1240013 (99.87%) aligned 0 times 1506 (0.12%) aligned exactly 1 time 110 (0.01%) aligned >1 times 0.13% overall alignment rate tRNA Reads with multiple transcripts:0/0 tRNA Reads with multiple anticodons:0/0 tRNA Reads with multiple aminos:0/0 Mappings Removed:417/2459

1300266 reads; of these: 1300266 (100.00%) were unpaired; of these: 1298183 (99.84%) aligned 0 times 1898 (0.15%) aligned exactly 1 time 185 (0.01%) aligned >1 times 0.16% overall alignment rate tRNA Reads with multiple transcripts:0/0 tRNA Reads with multiple anticodons:0/0 tRNA Reads with multiple aminos:0/0 Mappings Removed:502/3090 ` If you have any suggestions as to what might be the reason for it, please do let me know.

Kind regards, M

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/UCSC-LoweLab/tRAX/issues/5?email_source=notifications&email_token=AEHEMLWADPSJOYDKJTFFCB3QXJT7PA5CNFSM4JW2BPYKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H6VFV6Q, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHEMLUZUAEHCTE5URBGTLDQXJT7PANCNFSM4JW2BPYA .

emm13 commented 4 years ago

Hi Andrew,

Thank you for writing back. I'm sorry to keep pestering you about this. The reason I assumed the reads on the SRA were trimmed is that there is only one file per sample. Following the analysis outlined in your paper, this would be the case once read1 and read2 files were trimmed and merged using SeqPrep.

However, following on from your previous comment, I tried to adapter trim using the adapters listed in SeqPrep manual. Interestingly, only the adapter for the forward read primer works and that too, not on all files (eg: SRR1909074.fastq) on the SRA. Grep statement and number of hits listed below.

more Input/SRR1909074.fastq | grep -c GATCGGAAGAGC 0 more Input/SRR1909076.fastq | grep -c GATCGGAAGAGC 0 more Input/SRR1909078.fastq | grep -c GATCGGAAGAGC 1154780 more Input/SRR1909107.fastq | grep -c GATCGGAAGAGC 1154295 more Input/SRR1909108.fastq | grep -c GATCGGAAGAGC 700059 more Input/SRR1909113.fastq | grep -c GATCGGAAGAGC 1099459 more Input/SRR1909115.fastq | grep -c GATCGGAAGAGC 1301391 more Input/SRR1909116.fastq | grep -c GATCGGAAGAGC 0 more Input/SRR1909118.fastq | grep -c GATCGGAAGAGC 1241282 more Input/SRR1909119.fastq | grep -c GATCGGAAGAGC 1278277 more Input/SRR1909120.fastq | grep -c GATCGGAAGAGC 1184786 more Input/SRR1909121.fastq | grep -c GATCGGAAGAGC 1198770

My guess is that only the Read1/Forward read files have been uploaded. Additionally, for some of them, these are incorrect (could be mixed up with the yeast ones?). I don't want to spend too much more time on this but if you have any suggestions to proceed from this step, do let me know.

Kind regards,

Manasa