CCBR / RENEE

A comprehensive quality-control and quantification RNA-seq pipeline
https://CCBR.github.io/RENEE/
MIT License
4 stars 4 forks source link

ResourceWarning in v2.6 #160

Open kelly-sovacool opened 2 months ago

kelly-sovacool commented 2 months ago

warning message with v2.6:

cd /data/CCBR_Pipeliner/Pipelines/RENEE
./v2.6/bin/renee --version
Python version: 3.11.3
/data/CCBR_Pipeliner/db/PipeDB/Conda/envs/py311/lib/python3.11/glob.py:176: ResourceWarning: unclosed <socket.socket fd=5, family=2, type=1, proto=0, laddr=('0.0.0.0', 0)>
  with contextlib.closing(_iterdir(dirname, dir_fd, dironly)) as it:
ResourceWarning: Enable tracemalloc to get the object allocation traceback
renee v2.6.0-dev

this doesn't happen with v2.5:

./v2.5/bin/renee --version
[+] Loading singularity  4.0.3  on cn2066 
renee v2.5.12

It seems there's an opened file that never gets closed? https://stackoverflow.com/a/61373209

kelly-sovacool commented 3 weeks ago

as of v2.6.1, this no longer occurs with --version but it does with run.

renee run \
    --input /data/CCBR_Pipeliner/Pipelines/RENEE/develop/.tests/*.R?.fastq.gz \
    --output /data/$USER/renee_test_rel-7 \
    --genome hg38_30 \
    --mode slurm \
    --sif-cache /data/CCBR_Pipeliner/SIFS
[+] Loading singularity  4.1.5  on cn4312
[+] Loading snakemake  7.32.4
Python version: 3.11.3
RENEE (v2.6.1)
Thank you for running RENEE on BIOWULF!
Generating config file in '/data/sovacoolkl/renee_test_rel-7/config.json'... Done!
/data/sovacoolkl/renee_test_rel-7/resources/runner slurm -j pl:renee -b /gpfs/gsfs10/users/CCBR_Pipeliner,/data/CCBR_Pipeliner,/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/RENEE/develop/.tests,/data/sovacoolkl/renee_test_rel-7,/lscratch -o /data/sovacoolkl/renee_test_rel-7 -c /data/sovacoolkl/renee_test_rel-7/.singularity -t /lscratch/$SLURM_JOBID -n biowulf
Successfully submitted master job: 39979651
sys:1: ResourceWarning: unclosed <socket.socket fd=4, family=2, type=1, proto=0, laddr=('0.0.0.0', 0)>
ResourceWarning: Enable tracemalloc to get the object allocation traceback
kopardev commented 1 day ago

@kelly-sovacool happens with help commands as well

renee run --help
[+] Loading singularity  4.1.5  on cn4294
[+] Loading snakemake  7.32.4
Python version: 3.11.3
renee run:  Runs the data-processing and quality-control pipeline.

Synopsis:
  $ renee run [--help] \
                      [--small-rna] [--star-2-pass-basic] \
                      [--dry-run] [--mode {slurm, local}] \
                      [--shared-resources SHARED_RESOURCES] \
                      [--singularity-cache SINGULARITY_CACHE] \
                      [--sif-cache SIF_CACHE] \
                      [--tmp-dir TMP_DIR] \
                      [--wait] \
                      [--create-nidap-folder] \
                      [--threads THREADS] \
                      --input INPUT [INPUT ...] \
                      --output OUTPUT \
                      --genome {hg38_36, mm10_M21, custom.json, ...}

Description:
  To run the pipeline with with your data, please provide a space separated
list of FastQs (globbing is supported), an output directory to store results,
and a reference genome.

  Optional arguments are shown in square brackets above. Please visit our docs
at "https://CCBR.github.io/RENEE/" for more information, examples, and
guides.

Required arguments:
  --input INPUT [INPUT ...]
                        Input FastQ file(s) to process. One or more FastQ files
                        can be provided. The pipeline supports single-end and
                        pair-end RNA-seq data.
                          Example: --input .tests/*.R?.fastq.gz

  --output OUTPUT
                        Path to an output directory. This location is where
                        the pipeline will create all of its output files, also
                        known as the pipeline's working directory. If the user
                        provided working directory has not been initialized,
                        it will be created automatically.
                          Example: --output /data/$USER/RNA_hg38

  --genome {hg38_36,mm10_M21,custom.json,...}
                        Reference genome. This option defines the reference
                        genome of the samples. The default is hg38_36 if not specifies.
                        RENEE on biowulf comes bundled with
                        pre-built reference files for human and mouse samples;
                        however, it is worth noting that the pipeline can accept
                        custom reference genomes created with the build sub
                        command. Run `renee --help` to view the current list of pre-built genomes.
                        A custom reference genome created with
                        the build sub command can also be provided. The name of
                        this custom reference JSON file is dependent on the
                        values provided to the following renee build args
                        '--ref-name REF_NAME --gtf-ver GTF_VER', where the name
                        of the output file uses the following naming convention:
                        '{REF_NAME}_{GTF_VER}.json'.
                          Example: --genome hg38_36

Analysis options:
  --small-rna           Uses ENCODE's recommendations for small RNA. This
                        option should be used with small RNA libraries. These
                        are rRNA-depleted libraries that have been size
                        selected to be shorter than 200bp. Size selection
                        enriches for small RNA species such as miRNAs, siRNAs,
                        or piRNAs. This option is only supported with single-
                        end data. This option should not be combined with the
                        star 2-pass basic option.
                          Example: --small-rna

  --star-2-pass-basic   Run STAR in per sample 2-pass mapping mode. It is
                        recommended to use this option when processing a set
                        of unrelated samples. It is not adivsed to use this
                        option for a study with multiple related samples. By
                        default, the pipeline ultilizes a multi sample 2-pass
                        mapping approach where the set of splice junctions
                        detected across all samples are provided to the second
                        pass of STAR. This option overrides the default
                        behavior so each sample will be processed in a per
                        sample two-pass basic mode. This option should not be
                        combined with the small RNA option.
                          Example: --star-2-pass-basic

Orchestration options:
  --dry-run             Does not execute anything. Only displays what steps in
                        the pipeline remain or will be run.
                          Example: --dry-run

  --mode {slurm,local}
                        Method of execution. Defines the mode of execution.
                        Valid options for this mode include: local or slurm.
                        Additional modes of execution are coming soon, default:
                        slurm.
                        Here is a brief description of each mode:
                           • local: uses local method of execution. local runs
                        will run serially on compute instance. This is useful
                        for testing, debugging, or when a users does not have
                        access to a  high  performance  computing environment.
                        If this option is not provided, it will default to a
                        slurm mode of execution.
                           • slurm: uses slurm execution backend. This method
                        will submit jobs to a  cluster  using sbatch. It is
                        recommended running the pipeline in this mode as it
                        will be significantly faster.
                          Example: --mode slurm

  --shared-resources SHARED_RESOURCES
                        Local path to shared resources. The pipeline uses a set
                        of shared reference files that can be re-used across ref-
                        erence genomes. These currently include reference files
                        for kraken and FQScreen. These reference files can be
                        downloaded with the build sub command's --shared-resources
                        option. These files only need to be downloaded once. If
                        you are running the pipeline on Biowulf, you do NOT need
                        to download these reference files! They already exist on
                        the filesystem in a location that anyone can access. If
                        you are running the pipeline on another cluster or target
                        system, you will need to download the shared resources
                        with the build sub command, and you will need to provide
                        this option to the run sub command every time. Please
                        provide the same path that was provided to the build sub
                        command's --shared-resources option.
                          Example: --shared-resources /data/shared/renee

  --singularity-cache SINGULARITY_CACHE
                        Overrides the $SINGULARITY_CACHEDIR variable. Images
                        from remote registries are cached locally on the file
                        system. By default, the singularity cache is set to:
                        '/path/to/output/directory/.singularity/'. Please note
                        that this cache cannot be shared across users.
                          Example: --singularity-cache /data/$USER

  --sif-cache SIF_CACHE
                        Path where a local cache of SIFs are stored. This cache
                        can be shared across users if permissions are properly
                        setup. If a SIF does not exist in the SIF cache, the
                        image will be pulled from Dockerhub. renee cache
                        sub command can be used to create a local SIF cache.
                        Please see renee cache for more information.
                           Example: --sif-cache /data/$USER/sifs/

  --wait
                        Wait until master job completes. This is required if
                        the job is submitted using HPC API. If not provided
                        the API may interpret submission of master job as
                        completion of the pipeline!

  --create-nidap-folder
                        Create folder called "NIDAP" with file to-be-moved back to NIDAP
                        This makes it convenient to move only this folder (called NIDAP)
                        and its content back to NIDAP, rather than the entire pipeline
                        output folder.

  --tmp-dir TMP_DIR
                        Path on the file system for writing temporary output
                        files. By default, the temporary directory is set to
                        '/lscratch/$SLURM_JOBID' on NIH's Biowulf cluster and
                        'OUTPUT' on the FRCE cluster.
                        However, if you are running the pipeline on another cluster,
                        this option will need to be specified.
                        Ideally, this path should point to a dedicated location on
                        the filesystem for writing tmp files.
                        On many systems, this location is
                        set to somewhere in /scratch. If you need to inject a
                        variable into this string that should NOT be expanded,
                        please quote this options value in single quotes.
                          Example: --tmp-dir '/cluster_scratch/$USER/'
  --threads THREADS
                        Max number of threads for local processes. It is
                        recommended setting this value to the maximum number
                        of CPUs available on the host machine, default: 2.
                          Example: --threads: 16

Misc Options:
  -h, --help            Show usage information, help message, and exit.
                          Example: --help

options:
  --create-nidap-folder
                        Create folder called "NIDAP" with file to-be-moved back to NIDAP This makes it convenient to move only this folder (called NIDAP) and
                        its content back to NIDAP, rather than the entire pipeline output folder
  --wait                Wait until master job completes. This is required if the job is submitted using HPC API. If not provided the API may interpret
                        submission of master job as completion of the pipeline!

Example:
  # Step 1.) Grab an interactive node,
  # do not run on head node and add
  # required dependencies to $PATH
  srun -N 1 -n 1 --time=1:00:00 --mem=8gb  --cpus-per-task=2 --pty bash
  module purge
  module load singularity snakemake

  # Step 2A.) Dry run pipeline with provided test data
  ./renee run --input .tests/*.R?.fastq.gz \
                 --output /data/$USER/RNA_hg38 \
                 --genome hg38_36 \
                 --mode slurm \
                 --dry-run

  # Step 2B.) Run RENEE pipeline
  # The slurm mode will submit jobs to the cluster.
  # It is recommended running renee in this mode.
  ./renee run --input .tests/*.R?.fastq.gz \
                 --output /data/$USER/RNA_hg38 \
                 --genome hg38_36 \
                 --mode slurm

Ver:
  v2.6.2

Prebuilt genome+annotation combos:
  ['hg19_19', 'hg19_36', 'hg38_30', 'hg38_34', 'hg38_36', 'hg38_38', 'hg38_41', 'hg38_45', 'mm10_M21', 'mm10_M23', 'mm10_M25', 'mmul10_mmul10_108']
sys:1: ResourceWarning: unclosed <socket.socket fd=4, family=2, type=1, proto=0, laddr=('0.0.0.0', 0)>
ResourceWarning: Enable tracemalloc to get the object allocation traceback
kopardev commented 1 day ago

@kelly-sovacool can you try using the following in python in order to get to the source of this issue:

warnings.simplefilter("default", ResourceWarning)
tracemalloc.start()