databio / pypiper

Python toolkit for building restartable pipelines
http://pypiper.databio.org
BSD 2-Clause "Simplified" License
45 stars 9 forks source link

get_stat should use pipestat's retrieve_one #202

Closed donaldcampbelljr closed 9 months ago

donaldcampbelljr commented 9 months ago
          Currently, per the tutorial, pepatac will create a separate `stats.yaml` for each of the input samples.
results_pipeline
  |__Tutorial1
       |___stats.yaml
  |__Tutorial2
       |___stats.yaml

This is problematic for using pipestat in the looper_config file which is necessary for looper report and looper link. This is because we can currently only choose one pipestat results file in the looper config.

Spawning separate stats files is default pypiper behavior that can be overridden using the pipestat_results_file parameter.

This allows for specifiying a single results file for the pipeline output:

PEPATAC:
  project: {}
  sample:
    tutorial1:
      File_mb: 27
      pipestat_created_time: '2023-11-20 16:56:32'
      pipestat_modified_time: '2023-11-20 16:56:44'
      Read_type: paired
      Genome: hg38
      Raw_reads: '1000000'
      Fastq_reads: 1000000
      Trimmed_reads: 1000000
      FastQC report r1:
        path: /home/drc/pepatac_tutorial/tools/pepatac/examples/tutorial/home/drc/pepatac_tutorial/processed/results_pipeline/tutorial1/fastq/tutorial1_R1_trim_fastqc.html
        thumbnail_path: null
        title: FastQC report r1
        annotation: PEPATAC
      FastQC report r2:
        path: /home/drc/pepatac_tutorial/tools/pepatac/examples/tutorial/home/drc/pepatac_tutorial/processed/results_pipeline/tutorial1/fastq/tutorial1_R2_trim_fastqc.html
        thumbnail_path: null
        title: FastQC report r2
        annotation: PEPATAC
      Aligned_reads_rCRSd: 99360.0
      Alignment_rate_rCRSd: 9.94
    tutorial2:
      File_mb: 27
      pipestat_created_time: '2023-11-20 16:58:02'
      pipestat_modified_time: '2023-11-20 16:58:12'
      Read_type: paired
      Genome: hg38
      Raw_reads: '1000000'
      Fastq_reads: 1000000
      Trimmed_reads: 1000000
      FastQC report r1:
        path: /home/drc/pepatac_tutorial/tools/pepatac/examples/tutorial/home/drc/pepatac_tutorial/processed/results_pipeline/tutorial2/fastq/tutorial2_R1_trim_fastqc.html
        thumbnail_path: null
        title: FastQC report r1
        annotation: PEPATAC
      FastQC report r2:
        path: /home/drc/pepatac_tutorial/tools/pepatac/examples/tutorial/home/drc/pepatac_tutorial/processed/results_pipeline/tutorial2/fastq/tutorial2_R2_trim_fastqc.html
        thumbnail_path: null
        title: FastQC report r2
        annotation: PEPATAC
      Aligned_reads_rCRSd: 100556.0
      Alignment_rate_rCRSd: 10.06

This works well until the pipeline attempts to retrieve a stat via pm.get_stat. When it attempts to retrieve a result from a file that contains more than one samples, it errors.

Missing stat 'Raw_reads'
Traceback (most recent call last):
  File "/home/drc/pepatac_tutorial//tools/pepatac/pipelines/pepatac.py", line 2784, in <module>
    sys.exit(main())
  File "/home/drc/pepatac_tutorial//tools/pepatac/pipelines/pepatac.py", line 1117, in main
    pm.run([cmd, cmd2], rmdup_bam, follow=check_alignment_genome)
  File "/home/drc/GITHUB/pepatac/pepatac/venv/lib/python3.10/site-packages/pypiper/manager.py", line 1093, in run
    call_follow()
  File "/home/drc/GITHUB/pepatac/pepatac/venv/lib/python3.10/site-packages/pypiper/manager.py", line 947, in call_follow
    follow()
  File "/home/drc/pepatac_tutorial//tools/pepatac/pipelines/pepatac.py", line 1106, in check_alignment_genome
    rr = float(pm.get_stat("Raw_reads"))
TypeError: float() argument must be a string or a real number, not 'NoneType'

I believe the solution is to have pypiper instead use pipestat's retrieve_one. Perhaps get_stat can be a wrapper for this.

Originally posted by @donaldcampbelljr in https://github.com/databio/pepatac/issues/257#issuecomment-1819897535

donaldcampbelljr commented 9 months ago

Actually, I was able to modify get_stat such that is does not only check the first sample in the results file vs the pipestat manager's record_identifier: 2de4e842d6ba555cc1f18473403c04085dc586b2