databio / pepatac

A modular, containerized pipeline for ATAC-seq data processing
BSD 2-Clause "Simplified" License
54 stars 14 forks source link

question about setting number of cores #194

Closed pegahtak closed 3 years ago

pegahtak commented 3 years ago

Hello and thank you for great work! I am trying to run pipeline for my dataset with the following config file

name: Neutrophil
pep_version: 2.0.0
sample_table: Neut_annotation.csv
subsample_table: sub_Neut.csv

  output_dir: "pepatac_Neutrophil/"
  pipeline_interfaces: "../../project_pipeline_interface.yaml"

    pipeline_interfaces: "../../sample_pipeline_interface.yaml"
    attributes: [read1, read2]
      R1: "Neutrophil/{sample_name}/{technical_replicate}_1.fastq.gz"
      R2: "Neutrophil/{sample_name}/{technical_replicate}_2.fastq.gz"
    - if:
        organism: ["human", "Homo sapiens", "Human", "Homo_sapiens"]
        genome: hg38
        single_or_paired: paired
        cores: 5
        aligner: bowtie2
        peak_caller: macs2
        trimmer: skewer
        prealignments: rCRSd human_repeats
        deduplicator: picard
        blacklist: "/media/mehrmohammadi_hdd/taklifi/hg38_encode_exclusion.bed"
        peak_type: fixed
        extend: "250"
        firp_ref_peaks: "/media/kavousi/eaf2d15a-4cb1-4dee-ade8-6954bdc813e1/Taklifi/Open_Panel/TCGA_peaks.bed"

as you can see I tried to only use 5 cores, however the pipeline is using 32 cores.

*           `TSS_name`:  `None`
*            `aligner`:  `bowtie2`
*          `anno_name`:  `None`
*          `blacklist`:  `/media/mehrmohammadi_hdd/taklifi/hg38_encode_exclusion.bed`
*        `config_file`:  `pepatac.yaml`
*              `cores`:  `32`
*       `deduplicator`:  `picard`
*              `dirty`:  `False`
*             `extend`:  `250`
*       `force_follow`:  `False`
*     `frip_ref_peaks`:  `None`
*    `genome_assembly`:  `hg38`
*        `genome_size`:  `2.7e9`
*              `input`:  `['Neutrophil/N1/SRR11909926_1.fastq.gz', 'Neutrophil/N1/SRR11909927_1.fastq.gz']`
*             `input2`:  `['Neutrophil/N1/SRR11909926_2.fastq.gz', 'Neutrophil/N1/SRR11909927_2.fastq.gz']`
*               `keep`:  `False`
*               `lite`:  `False`
*             `logdev`:  `False`
*                `mem`:  `24000`
*              `motif`:  `False`
*          `new_start`:  `False`
*            `no_fifo`:  `False`
*           `no_scale`:  `False`
*      `output_parent`:  `pepatac_Neutrophil/results_pipeline`
*         `paired_end`:  `True`
*        `peak_caller`:  `macs2`
*          `peak_type`:  `fixed`
*      `prealignments`:  `['rCRSd', 'human_repeats']`
*         `prioritize`:  `False`
*            `recover`:  `False`
*        `sample_name`:  `N1`
*             `silent`:  `False`
*   `single_or_paired`:  `paired`
*             `skipqc`:  `False`
*                `sob`:  `False`
*           `testmode`:  `False`
*            `trimmer`:  `skewer`
*          `verbosity`:  `None`

how can I set the number of cores I want the pipeline to use ? Thank you PEPATAC_N1.log

nsheff commented 3 years ago

I believe the cores has to be set via the compute namespace in looper, rather than as a sample attribute, because it's a looper parameter.

Can you try using looper run ... --compute cores=5, or adding this to your project config:

    cores: 5

since you already have a looper section in the config it would look like this:

  output_dir: "pepatac_Neutrophil/"
  pipeline_interfaces: "../../project_pipeline_interface.yaml"
    cores: 5
pegahtak commented 3 years ago

@nsheff thank you for your response. when I added

    cores: 5

to my config file I get the following error:

Looper version: 1.3.0
Command: run
Traceback (most recent call last):
  File "/home/ptaklifi/.local/bin/looper", line 8, in <module>
  File "/home/ptaklifi/.local/lib/python3.6/site-packages/looper/", line 742, in main
    compute_kwargs = _proc_resources_spec(args)
  File "/home/ptaklifi/.local/lib/python3.6/site-packages/looper/", line 653, in _proc_resources_spec
    "Correct format: " + EXAMPLE_COMPUTE_SPEC_FMT)
ValueError: Could not correctly parse itemized compute specification. Correct format: k1=v1 k2=v2

but using looper run ... --compute cores=5 pipeline works correctly with specified number of cores

thank you

nsheff commented 3 years ago

Looks like the correct syntax for the config file is this:

    - "cores=5"

A better way to do this is to use the resources-sample.tsv file to change this and have it adjust based on input file size.

In your pepatac folder you should find that. that's where the 32 comes from. You must have a very large file. You can there adjust the cores for different sizes of files.

pegahtak commented 3 years ago

Thank you @nsheff . This works correctly .