cgat-developers / cgat-flow

cgat-flow repository
MIT License
13 stars 9 forks source link

Mapping rate no longer reported by any pipeline #85

Open IanSudbery opened 5 years ago

IanSudbery commented 5 years ago

Since the mapping pipeline was split into mapping and bamstats, as far as I can tell no pipeline now reports very basic statistics about mapped files, such as % mapping rate , % spliced reads etc.

By preference I think that the mapping pipeline should report this for two reasons:

I will try to have a look at this and the mapping tuples/compression option thing #80 this week if I find time.

Acribbs commented 5 years ago

I now usually use multiqc stats as my mapping rate.

IanSudbery commented 5 years ago

running multiqc via cgatflow mapping make build_report produces the following error:

Traceback (most recent call last):
      File "/shared/sudlab1/General/apps/conda/conda-install/envs/cgat-f/lib/python3.6/site-packages/ruffus/task.py", line 748, in run_pooled_job_without_exceptions
        register_cleanup, touch_files_only)
      File "/shared/sudlab1/General/apps/conda/conda-install/envs/cgat-f/lib/python3.6/site-packages/ruffus/task.py", line 632, in job_wrapper_output_files
        output_files_only=True)
      File "/shared/sudlab1/General/apps/conda/conda-install/envs/cgat-f/lib/python3.6/site-packages/ruffus/task.py", line 561, in job_wrapper_io_files
        ret_val = user_defined_work_func(*(params[1:]))
      File "/shared/sudlab1/General/apps/conda/cgat-flow-devel/cgatpipelines/tools/pipeline_mapping.py", line 2238, in renderMultiqc
        P.run(statement)
      File "/shared/sudlab1/General/apps/conda/cgat-core/cgatcore/pipeline/execution.py", line 1335, in run
        benchmark_data = r.run(statement_list)
      File "/shared/sudlab1/General/apps/conda/cgat-core/cgatcore/pipeline/execution.py", line 939, in run
        job_path)
      File "/shared/sudlab1/General/apps/conda/cgat-core/cgatcore/pipeline/execution.py", line 866, in collect_single_job_from_cluster
        job_id, retval.exitStatus, "".join(stderr), statement))
    OSError: ---------------------------------------
    Job 3564931 exited with error code 1: 
    The stderr was: 
    /etc/bashrc: line 12: PS1: unbound variable
    [WARNING]         multiqc : MultiQC Version v1.7 now available!
    [INFO   ]         multiqc : This is MultiQC v1.5.dev0
    [INFO   ]         multiqc : Template    : default
    [INFO   ]         multiqc : Searching '.'
    [WARNING]         multiqc : No analysis results found. Cleaning up..
    [INFO   ]         multiqc : MultiQC complete
    mv: cannot stat 'multiqc_report.html': No such file or directory

    export LC_ALL=en_GB.UTF-8 && export LANG=en_GB.UTF-8 && multiqc . -f && mv multiqc_report.html MultiQC_report.dir/
Acribbs commented 5 years ago

Which mapper were you using? I suspect the outputs of our logs do not match the required input for some of our mappers in MultiQC. I know this is the case for salmon in transdiffexpres and maybe I think for STAR in mapping. I think it is due to the way we redirect the outputs to logs.

IanSudbery commented 5 years ago

We mostly use STAR, Salmon and BWA.

BWA isn't even supported by MultiQC, mostly because i don't think it outputs a log file of any sort.

Acribbs commented 5 years ago

For STAR: This MultiQC module parses summary statistics from the Log.final.out log files. Sample names are taken either from the filename prefix (sampleNameLog.final.out) when set with --outFileNamePrefix in STAR

However, the output of our star mapping produces this.

When I run the pipeline for STAR is generates the appropriate output for both bowtie and STAR (im using our pipeline test data), but obviously not bwa . The reason they down support BWA is because the logs don't produce anything worth parsing so their idea was to rely on downstream tools. See: https://github.com/ewels/MultiQC/issues/162

Did your mapper fail or is there something else that prevented logs being output from STAR?

IanSudbery commented 5 years ago

The particular example here is BWA (which is probably the mapper we use the most - we do most RNAseq with salmon these days).

We used to actually calculate the mapping rate rather than rely on logs.