Mapping rate no longer reported by any pipeline

IanSudbery commented 5 years ago

Since the mapping pipeline was split into mapping and bamstats, as far as I can tell no pipeline now reports very basic statistics about mapped files, such as % mapping rate , % spliced reads etc.

By preference I think that the mapping pipeline should report this for two reasons:

I can't imagine anyone ever mapping a set of reads and not wanting to see the mapping rate
The best way to obtain the mapping rate is going to depend on the mapper used. For example, STAR reports it in its output file, where as for BWA it will need to be calculated from the BAM file.

I will try to have a look at this and the mapping tuples/compression option thing #80 this week if I find time.

Acribbs commented 5 years ago

I now usually use multiqc stats as my mapping rate.

IanSudbery commented 5 years ago

running multiqc via cgatflow mapping make build_report produces the following error:

Traceback (most recent call last):
      File "/shared/sudlab1/General/apps/conda/conda-install/envs/cgat-f/lib/python3.6/site-packages/ruffus/task.py", line 748, in run_pooled_job_without_exceptions
        register_cleanup, touch_files_only)
      File "/shared/sudlab1/General/apps/conda/conda-install/envs/cgat-f/lib/python3.6/site-packages/ruffus/task.py", line 632, in job_wrapper_output_files
        output_files_only=True)
      File "/shared/sudlab1/General/apps/conda/conda-install/envs/cgat-f/lib/python3.6/site-packages/ruffus/task.py", line 561, in job_wrapper_io_files
        ret_val = user_defined_work_func(*(params[1:]))
      File "/shared/sudlab1/General/apps/conda/cgat-flow-devel/cgatpipelines/tools/pipeline_mapping.py", line 2238, in renderMultiqc
        P.run(statement)
      File "/shared/sudlab1/General/apps/conda/cgat-core/cgatcore/pipeline/execution.py", line 1335, in run
        benchmark_data = r.run(statement_list)
      File "/shared/sudlab1/General/apps/conda/cgat-core/cgatcore/pipeline/execution.py", line 939, in run
        job_path)
      File "/shared/sudlab1/General/apps/conda/cgat-core/cgatcore/pipeline/execution.py", line 866, in collect_single_job_from_cluster
        job_id, retval.exitStatus, "".join(stderr), statement))
    OSError: ---------------------------------------
    Job 3564931 exited with error code 1: 
    The stderr was: 
    /etc/bashrc: line 12: PS1: unbound variable
    [WARNING]         multiqc : MultiQC Version v1.7 now available!
    [INFO   ]         multiqc : This is MultiQC v1.5.dev0
    [INFO   ]         multiqc : Template    : default
    [INFO   ]         multiqc : Searching '.'
    [WARNING]         multiqc : No analysis results found. Cleaning up..
    [INFO   ]         multiqc : MultiQC complete
    mv: cannot stat 'multiqc_report.html': No such file or directory

    export LC_ALL=en_GB.UTF-8 && export LANG=en_GB.UTF-8 && multiqc . -f && mv multiqc_report.html MultiQC_report.dir/

Acribbs commented 5 years ago

Which mapper were you using? I suspect the outputs of our logs do not match the required input for some of our mappers in MultiQC. I know this is the case for salmon in transdiffexpres and maybe I think for STAR in mapping. I think it is due to the way we redirect the outputs to logs.

IanSudbery commented 5 years ago

We mostly use STAR, Salmon and BWA.

BWA isn't even supported by MultiQC, mostly because i don't think it outputs a log file of any sort.

Acribbs commented 5 years ago

For STAR: This MultiQC module parses summary statistics from the Log.final.out log files. Sample names are taken either from the filename prefix (sampleNameLog.final.out) when set with --outFileNamePrefix in STAR

However, the output of our star mapping produces this.

When I run the pipeline for STAR is generates the appropriate output for both bowtie and STAR (im using our pipeline test data), but obviously not bwa . The reason they down support BWA is because the logs don't produce anything worth parsing so their idea was to rely on downstream tools. See: https://github.com/ewels/MultiQC/issues/162

Did your mapper fail or is there something else that prevented logs being output from STAR?

IanSudbery commented 5 years ago

The particular example here is BWA (which is probably the mapper we use the most - we do most RNAseq with salmon these days).

We used to actually calculate the mapping rate rather than rely on logs.

cgat-developers / cgat-flow

Mapping rate no longer reported by any pipeline #85