Separate stats.yaml vs unified results.yaml

donaldcampbelljr commented 1 year ago

          Related to this issue, unable to resolve pipestat's `results.yaml`  in the looper config file without inputting it as an absolute path _after_ it has been created by pypiper.

This is because pypiper will create a stats.yaml results file if pipestat_results_file is not given to pypiper as an input parameter.

Current workaround is to add the actual path after the pipeline has run, simply so that it can be used for looper report and looper link functionality.

name: PEPATAC_tutorial
pep_config: tutorial_refgenie_project_config.yaml

output_dir: "${TUTORIAL}/processed/"
pipeline_interfaces:
  sample: ["${TUTORIAL}/tools/pepatac/sample_pipeline_interface.yaml"]
  project: ["${TUTORIAL}/tools/pepatac/project_pipeline_interface.yaml"]

pipestat:
  results_file_path: /home/drc/pepatac_tutorial/tools/pepatac/examples/tutorial/home/drc/pepatac_tutorial/processed/results_pipeline/tutorial1/stats.yaml
  #results_file_path: "${TUTORIAL}/processed/results_pipeline/{sample_name}/stats.yaml" # This does not work

Originally posted by @donaldcampbelljr in https://github.com/databio/pepatac/issues/256#issuecomment-1819628165

donaldcampbelljr commented 1 year ago

Currently, per the tutorial, pepatac will create a separate stats.yaml for each of the input samples.

results_pipeline
  |__Tutorial1
       |___stats.yaml
  |__Tutorial2
       |___stats.yaml

This is problematic for using pipestat in the looper_config file which is necessary for looper report and looper link. This is because we can currently only choose one pipestat results file in the looper config.

Spawning separate stats files is default pypiper behavior that can be overridden using the pipestat_results_file parameter.

This allows for specifiying a single results file for the pipeline output:

PEPATAC:
  project: {}
  sample:
    tutorial1:
      File_mb: 27
      pipestat_created_time: '2023-11-20 16:56:32'
      pipestat_modified_time: '2023-11-20 16:56:44'
      Read_type: paired
      Genome: hg38
      Raw_reads: '1000000'
      Fastq_reads: 1000000
      Trimmed_reads: 1000000
      FastQC report r1:
        path: /home/drc/pepatac_tutorial/tools/pepatac/examples/tutorial/home/drc/pepatac_tutorial/processed/results_pipeline/tutorial1/fastq/tutorial1_R1_trim_fastqc.html
        thumbnail_path: null
        title: FastQC report r1
        annotation: PEPATAC
      FastQC report r2:
        path: /home/drc/pepatac_tutorial/tools/pepatac/examples/tutorial/home/drc/pepatac_tutorial/processed/results_pipeline/tutorial1/fastq/tutorial1_R2_trim_fastqc.html
        thumbnail_path: null
        title: FastQC report r2
        annotation: PEPATAC
      Aligned_reads_rCRSd: 99360.0
      Alignment_rate_rCRSd: 9.94
    tutorial2:
      File_mb: 27
      pipestat_created_time: '2023-11-20 16:58:02'
      pipestat_modified_time: '2023-11-20 16:58:12'
      Read_type: paired
      Genome: hg38
      Raw_reads: '1000000'
      Fastq_reads: 1000000
      Trimmed_reads: 1000000
      FastQC report r1:
        path: /home/drc/pepatac_tutorial/tools/pepatac/examples/tutorial/home/drc/pepatac_tutorial/processed/results_pipeline/tutorial2/fastq/tutorial2_R1_trim_fastqc.html
        thumbnail_path: null
        title: FastQC report r1
        annotation: PEPATAC
      FastQC report r2:
        path: /home/drc/pepatac_tutorial/tools/pepatac/examples/tutorial/home/drc/pepatac_tutorial/processed/results_pipeline/tutorial2/fastq/tutorial2_R2_trim_fastqc.html
        thumbnail_path: null
        title: FastQC report r2
        annotation: PEPATAC
      Aligned_reads_rCRSd: 100556.0
      Alignment_rate_rCRSd: 10.06

This works well until the pipeline attempts to retrieve a stat via pm.get_stat. When it attempts to retrieve a result from a file that contains more than one samples, it errors.

Missing stat 'Raw_reads'
Traceback (most recent call last):
  File "/home/drc/pepatac_tutorial//tools/pepatac/pipelines/pepatac.py", line 2784, in <module>
    sys.exit(main())
  File "/home/drc/pepatac_tutorial//tools/pepatac/pipelines/pepatac.py", line 1117, in main
    pm.run([cmd, cmd2], rmdup_bam, follow=check_alignment_genome)
  File "/home/drc/GITHUB/pepatac/pepatac/venv/lib/python3.10/site-packages/pypiper/manager.py", line 1093, in run
    call_follow()
  File "/home/drc/GITHUB/pepatac/pepatac/venv/lib/python3.10/site-packages/pypiper/manager.py", line 947, in call_follow
    follow()
  File "/home/drc/pepatac_tutorial//tools/pepatac/pipelines/pepatac.py", line 1106, in check_alignment_genome
    rr = float(pm.get_stat("Raw_reads"))
TypeError: float() argument must be a string or a real number, not 'NoneType'

I believe the solution is to have pypiper instead use pipestat's retrieve_one. Perhaps get_stat can be a wrapper for this.

donaldcampbelljr commented 12 months ago

Solution was implemented in pypiper: https://github.com/databio/pypiper/issues/202#issuecomment-1828469016

Related, pipestat was modified to create subdirectories during result_file_path creation: https://github.com/pepkit/pipestat/commit/76d79d915ab90ab763c58d9ac74c80dcdfb0d74d

databio / pepatac

Separate stats.yaml vs unified results.yaml #257