gaolabtools / scNanoGPS

Single cell Nanopore sequencing data for Genotype and Phenotype
Other
34 stars 1 forks source link

bioconda recipe? #36

Open nick-youngblut opened 3 weeks ago

nick-youngblut commented 3 weeks ago

conda is leveraged for installing (some) dependencies (at least, the recommended install method), but conda is not fully leveraged via a bioconda recipe for installing scNanoGPS.

Are there plans on creating a bioconda recipe for scNanoGPS in order to simplify the install to just:

mamba install bioconda::scnanogps

A template created via ChatGPT:

package:
  name: scNanoGPS
  version: "1.0.0"

source:
  git_url: https://github.com/gaolabtools/scNanoGPS.git
  git_rev: main

build:
  number: 0
  script: "{{ PYTHON }} -m pip install . --no-deps --ignore-installed -vv"

requirements:
  build:
    - {{ compiler('c') }}
    - {{ compiler('cxx') }}
    - python
    - pip

  host:
    - python >=3.7
    - biopython >=1.79
    - distance >=0.1.3
    - matplotlib >=3.5.2
    - pandas >=1.4.2
    - pysam >=0.19.0
    - seaborn >=0.11.2
    - liqa

  run:
    - python >=3.7
    - biopython >=1.79
    - distance >=0.1.3
    - matplotlib >=3.5.2
    - pandas >=1.4.2
    - pysam >=0.19.0
    - seaborn >=0.11.2
    - liqa

test:
  imports:
    - scNanoGPS
  commands:
    - scNanoGPS --help

about:
  home: https://github.com/gaolabtools/scNanoGPS
  license: MIT
  summary: "A pipeline for single-cell nanopore long-read sequencing data analysis."
  description: "scNanoGPS is a pipeline designed for the analysis of single-cell nanopore long-read sequencing data, focusing on mapping, collapsing reads, and summarizing gene expression, isoforms, and SNVs."

extra:
  recipe-maintainers:
    - your-github-username
nick-youngblut commented 3 weeks ago

It would be helpful if the requirements.txt file listed supported versions.

As for Install other essential tools, I'm curious why those tools are not just listed in the conda create -n scNanoGPS command:

mamba create -n scNanoGPS python=3 numpy scipy \
  minimap2 samtools tabix spoa subread liqa longshot bcftools qualimap gffread

Can you please list the supported versions for each of these dependencies?

Note: there is a bioconda recipe for gffread.

Is ANNOVAR fully necessary, given that it is not (and likely will never be) available via conda? Maybe VEP, GEMINI, or VAAST could be used instead?

nick-youngblut commented 3 weeks ago

To be clear, you will likely have to convert the collection of scripts is this repo to a python package, given that you are currently just importing based on fixed, relative paths (e.g., scanner.py), but if you accept PRs, I can help with this.

A couple of other notes

Option parsing

At least with argparse, you don't have to redundantly set the default twice. For example, in:

parser.add_option("--tmp_dir",       dest = "tmp_dir",       nargs = 1, default = "tmp",
                  help = "Temporary folder name. "
                         "Default: tmp")

... with argparse, you don't need to set "Default: tmp" in the help text, since it will automatically be added.

Option parsing checks

At least with argparse, you can auto-check whether the file exists:

parser.add_argument('input_file', type=argparse.FileType('r'), help='Input file')

...so then no need for:

if not os.path.isfile(options.CB_list):

os.system

os.system is not a robust method of calling subprocess (example in your code.

An example of a more robust method using subprocess:

import subprocess

def run_subprocess(command, timeout=None):
    """
    Run a subprocess command and return the output.

    Args:
        command (list): The command to run as a list of strings.
        timeout (int, optional): The timeout for the command in seconds.

    Returns:
        str: The standard output of the command.

    Raises:
        subprocess.CalledProcessError: If the command exits with a non-zero status.
        subprocess.TimeoutExpired: If the command times out.
    """
    try:
        result = subprocess.run(
            command,
            check=True,  # Raise CalledProcessError for non-zero exit codes
            stdout=subprocess.PIPE,  # Capture standard output
            stderr=subprocess.PIPE,  # Capture standard error
            timeout=timeout,  # Timeout for the command
            text=True  # Return output as a string
        )
        return result.stdout
    except subprocess.CalledProcessError as e:
        print(f"Command '{e.cmd}' returned non-zero exit status {e.returncode}.")
        print(f"Standard output: {e.stdout}")
        print(f"Standard error: {e.stderr}")
        raise
    except subprocess.TimeoutExpired as e:
        print(f"Command '{e.cmd}' timed out after {timeout} seconds.")
        print(f"Standard output: {e.stdout}")
        print(f"Standard error: {e.stderr}")
        raise

run_scNanoGPS.sh

There is no || exit 1 or && \ at the end of each command, so all commands will run even if some fail.

It wouldn't be too hard to create a simple Nextflow pipeline to run the python scripts in a more robust manner. Let me know if you'd like some help.