MultiQC / MultiQC

Aggregate results from bioinformatics analyses across many samples into a single report.
http://multiqc.info
GNU General Public License v3.0
1.21k stars 599 forks source link

Option to not include bowtie logs within tophat folders #188

Closed jfreimer closed 8 years ago

jfreimer commented 8 years ago

Hi, Great software. I had one feature suggestions. Right now multiqc will include all of the bowtie logs within a tophat folder run, whereas I think that most people just care about the results of the final tophat log. However, I don't want to entirely exclude the bowtie module as I use it in other parts of the project. I think it would be nice to have the option for multiqc to ignore these logs?

ewels commented 8 years ago

Ah yes - I created an issue for this previously - #122 but was unable to replicate it when I came to fixing the problem. Do you have a log file that shows this effect that I can use for testing?

Bowtie logs are generally really terrible though - they're incredibly minimal and really difficult to find, as well as being embedded everywhere. The module already checks for the word 'bisulfite' and skips that log if found because it was doing the same thing with logs from Bismark. I'll try to find something specific to tophat logs and skip if that string is found.

ewels commented 8 years ago

ps. Glad you like the software! :)

jfreimer commented 8 years ago

Which log files would you like me to send you? If this is possible, an easy fix might be to ignore the bowtie logs if they are in the same folder as tophat.log. Usually the tophat folder contains:

accepted_hits.bam      accepted_hits_refChr.bam      align_summary.txt  insertions.bed  logs             unmapped.bam
accepted_hits.bam.bai  accepted_hits_refChr.bam.bai  deletions.bed      junctions.bed   prep_reads.info

So the final tophat log (align_summary.txt) is in a separate folder from the logs folder which contains all of the intermediate logs.

My logs folder within tophat contains:

bam_merge_um.log            bowtie.left_kept_reads.m2g_um.log       juncs_db.log                  prep_reads.log              reports.samtools_sort.log1  segment_juncs.log
bowtie_build.log            bowtie.left_kept_reads.m2g_um_seg1.log  long_spanning_reads.segs.log  reports.log                 reports.samtools_sort.log2  tophat.log
bowtie_inspect_recons.log   bowtie.left_kept_reads.m2g_um_seg2.log  m2g_left_kept_reads.err       reports.merge_bam.log       reports.samtools_sort.log3
bowtie.left_kept_reads.log  gtf_juncs.log                           m2g_left_kept_reads.out       reports.samtools_sort.log0  run.log
ewels commented 8 years ago

Ah, that would explain it - I must have cleaned up my testing data to only contain the tophat log, hence the problem went away. I assumed that both sets of messages were wrapped up in the same log file.

It would be useful to find which file(s) contain the string # reads processed: if that's ok. Then maybe see the full contents of those files to see if they have anything else we can use. Otherwise, as you say, we can look at the context of the file rather than it's contents. I have the file path already in hand, so I'll probably just opt for checking if it ends in logs/bowtie_xxx.log and ignore it if so (easier & faster than looking around at the other files in the same folder, though this is obviously possible if required).

Phil

jfreimer commented 8 years ago

[jfreimer@h2 logs]$ grep 'processed:' * bowtie.left_kept_reads.log:# reads processed: 40245956 bowtie.left_kept_reads.m2g_um.log:# reads processed: 8785725 bowtie.left_kept_reads.m2g_um_seg1.log:# reads processed: 2176857 bowtie.left_kept_reads.m2g_um_seg2.log:# reads processed: 915985

The logs all look like this: reads processed: 40245956 reads with at least one reported alignment: 31460231 (78.17%) reads that failed to align: 8386873 (20.84%) reads with alignments suppressed due to -m: 398852 (0.99%) Reported 128873269 alignments to 1 output stream(s)

ewels commented 8 years ago

Fantastic, four different files with four different numbers - that's sure to confuse some people :)

These filenames seem pretty specific - I think I'll just check for them and skip. Only other thing - is this single end data? Will paired end data also have bowtie.right_kept_reads.log or anything? Apologies for not looking more myself - the pipeline I usually use removes all of these files so I don't have any lying around..

Phil

ewels commented 8 years ago

Ok, just pushed an update - let me know if that fixes it for you.

jfreimer commented 8 years ago

All of mine is single end data, but I believe paired end will have the right data as well.

ewels commented 8 years ago

Ok, added. Unlikely to do any harm anyway.

jfreimer commented 8 years ago

Works. Thanks.