databio / pepatac

A modular, containerized pipeline for ATAC-seq data processing
http://pepatac.databio.org
BSD 2-Clause "Simplified" License
54 stars 14 forks source link

question about setting number of cores #194

Closed pegahtak closed 3 years ago

pegahtak commented 3 years ago

Hello and thank you for great work! I am trying to run pipeline for my dataset with the following config file

name: Neutrophil
pep_version: 2.0.0
sample_table: Neut_annotation.csv
subsample_table: sub_Neut.csv

looper:
  output_dir: "pepatac_Neutrophil/"
  pipeline_interfaces: "../../project_pipeline_interface.yaml"

sample_modifiers:
  append:
    pipeline_interfaces: "../../sample_pipeline_interface.yaml"
  derive:
    attributes: [read1, read2]
    sources:
      R1: "Neutrophil/{sample_name}/{technical_replicate}_1.fastq.gz"
      R2: "Neutrophil/{sample_name}/{technical_replicate}_2.fastq.gz"
  imply:
    - if:
        organism: ["human", "Homo sapiens", "Human", "Homo_sapiens"]
      then:
        genome: hg38
        single_or_paired: paired
        cores: 5
        aligner: bowtie2
        peak_caller: macs2
        trimmer: skewer
        prealignments: rCRSd human_repeats
        deduplicator: picard
        blacklist: "/media/mehrmohammadi_hdd/taklifi/hg38_encode_exclusion.bed"
        peak_type: fixed
        extend: "250"
        firp_ref_peaks: "/media/kavousi/eaf2d15a-4cb1-4dee-ade8-6954bdc813e1/Taklifi/Open_Panel/TCGA_peaks.bed"

as you can see I tried to only use 5 cores, however the pipeline is using 32 cores.


*           `TSS_name`:  `None`
*            `aligner`:  `bowtie2`
*          `anno_name`:  `None`
*          `blacklist`:  `/media/mehrmohammadi_hdd/taklifi/hg38_encode_exclusion.bed`
*        `config_file`:  `pepatac.yaml`
*              `cores`:  `32`
*       `deduplicator`:  `picard`
*              `dirty`:  `False`
*             `extend`:  `250`
*       `force_follow`:  `False`
*     `frip_ref_peaks`:  `None`
*    `genome_assembly`:  `hg38`
*        `genome_size`:  `2.7e9`
*              `input`:  `['Neutrophil/N1/SRR11909926_1.fastq.gz', 'Neutrophil/N1/SRR11909927_1.fastq.gz']`
*             `input2`:  `['Neutrophil/N1/SRR11909926_2.fastq.gz', 'Neutrophil/N1/SRR11909927_2.fastq.gz']`
*               `keep`:  `False`
*               `lite`:  `False`
*             `logdev`:  `False`
*                `mem`:  `24000`
*              `motif`:  `False`
*          `new_start`:  `False`
*            `no_fifo`:  `False`
*           `no_scale`:  `False`
*      `output_parent`:  `pepatac_Neutrophil/results_pipeline`
*         `paired_end`:  `True`
*        `peak_caller`:  `macs2`
*          `peak_type`:  `fixed`
*      `prealignments`:  `['rCRSd', 'human_repeats']`
*         `prioritize`:  `False`
*            `recover`:  `False`
*        `sample_name`:  `N1`
*             `silent`:  `False`
*   `single_or_paired`:  `paired`
*             `skipqc`:  `False`
*                `sob`:  `False`
*           `testmode`:  `False`
*            `trimmer`:  `skewer`
*          `verbosity`:  `None`

how can I set the number of cores I want the pipeline to use ? Thank you PEPATAC_N1.log

nsheff commented 3 years ago

I believe the cores has to be set via the compute namespace in looper, rather than as a sample attribute, because it's a looper parameter.

Can you try using looper run ... --compute cores=5, or adding this to your project config:

looper:
  compute:
    cores: 5

since you already have a looper section in the config it would look like this:

looper:
  output_dir: "pepatac_Neutrophil/"
  pipeline_interfaces: "../../project_pipeline_interface.yaml"
  compute:
    cores: 5
pegahtak commented 3 years ago

@nsheff thank you for your response. when I added

 compute:
    cores: 5

to my config file I get the following error:

Looper version: 1.3.0
Command: run
Traceback (most recent call last):
  File "/home/ptaklifi/.local/bin/looper", line 8, in <module>
    sys.exit(main())
  File "/home/ptaklifi/.local/lib/python3.6/site-packages/looper/looper.py", line 742, in main
    compute_kwargs = _proc_resources_spec(args)
  File "/home/ptaklifi/.local/lib/python3.6/site-packages/looper/looper.py", line 653, in _proc_resources_spec
    "Correct format: " + EXAMPLE_COMPUTE_SPEC_FMT)
ValueError: Could not correctly parse itemized compute specification. Correct format: k1=v1 k2=v2

but using looper run ... --compute cores=5 pipeline works correctly with specified number of cores

thank you

nsheff commented 3 years ago

Looks like the correct syntax for the config file is this:

  compute:
    - "cores=5"

A better way to do this is to use the resources-sample.tsv file to change this and have it adjust based on input file size.

https://github.com/databio/pepatac/blob/master/resources-sample.tsv

In your pepatac folder you should find that. that's where the 32 comes from. You must have a very large file. You can there adjust the cores for different sizes of files.

pegahtak commented 3 years ago

Thank you @nsheff . This works correctly .