10XGenomics / cellranger

10x Genomics Single Cell Analysis
https://www.10xgenomics.com/support/software/cell-ranger
Other
348 stars 92 forks source link

Feature Request -- for Cellranger count -- specify output directory with `--dir` #82

Closed J-Moravec closed 10 months ago

J-Moravec commented 4 years ago

Currently, cellranger count uses --id for both output directory and as a prefix for some files (and possibly other things?)

so when running cellranger count --id=myid, I will get:

/current_dir/myid/
/current_dir/myid/myid.mri.tgz
...

If I want to create myid somewhere else than the current directory, such as mydir/myid, I can't because myid is used as a prefix for files and this breaks paths. This means that the cellranger count must be run from a directory where the output should be created. This is however not problem with other 10X tools, such as bamtofastq or cellranger mkref where an output path can be provided.

An easy fix to this would be to create an additional variable, --outdir. If --outdir is not specified, --id will be used to create ./id/. If --outdir is specified, it is used instead of id and id is used only as prefix for files and for other things, but not as name of folder.

User case I am remapping some older BAM files with a newer version of reference. I would like to have:

project/data/bam/ # sample1.bam sample2.bam ... sampleN.bam
project/results/ref/
project/results/sample1/fastq/
project/results/sample1/count/
...
project/results/sampleN/fastq/
project/results/sampleN/count/

which would be an easy to do with --outdir but its annoying to do otherwise. And being able to specify --outdir is a good practice.

Similar problem is with cellranger mkref. The --genome is both a path and a name.

WellJoea commented 3 years ago

you can change the path in shell script:

cd output_dir 
cellranger count --
sartmeier commented 2 years ago

Changing the folder is not feasible when running cellranger count inside a docker container. For example when a container is used to submit a cellranger job on a (private) cloud or HPC cluster. In this case, the command that is communicated to the container has to be a one-liner and both cd output_dir && cellranger count as well as cd output_dir; cellranger count lead to the following error: /usr/bin/cd: line 2: cd: too many arguments. It would really help to have the same functionality than with cellranger mkfastq where an output directory can be defined.

nigiord commented 1 year ago

you can change the path in shell script:

cd output_dir 
cellranger count --

Does not work either when running cellranger count in a Snakemake workflow. It’s generally considered bad practice to change directory inside scripts as it can impede portability and reproducibility since all the paths defined at a higher level in the workflow might be relative to the pipeline workdir.

Even though the pipestance folder is always in the current dir, cellranger mkfastq has at least the --output-dir argument (inherited from bcl2fastq I think) so at least the outputs can be put somewhere else. Something similar for count would be really useful.

nigiord commented 1 year ago

Apparently this is an option in the mrp command from Martian that is used by cellranger, I’m not sure if there’s a way to exploit this somehow.

https://github.com/martian-lang/martian/blob/668f7296ac21646bfe0160cc0e9d3a2763c99638/cmd/mrp/configure.go#L160-L161

    --psdir=PATH        The path to the pipestance directory.  The default is
                        to use <pipestance_name>.

If someone find a solution, please don’t hesitate to post here!

DaliBAmor commented 1 year ago

hello did you find a solution for this please ,?

nigiord commented 1 year ago

Hi @DaliBAmor, no I ended up creating my outputs locally then exporting them once the job is finished. I also compress and move the pipestance directory.

    shell:
        # --id cannot be a path so the pipestance is 
        # created locally and has to be compressed and moved afterwards.
        "cellranger-arc count "
        "--id=Pipestance-count-{wildcards.sample} "
        "--reference={params.refgenomedir} "
        "--libraries={input.libraries} "
        "--jobmode=local "
        "--localcores={threads} "
        "--localmem={resources.mem_gb} "
        "--disable-ui "  # Disable web interface
        "&> {log} "
        "&& mv Pipestance-count-{wildcards.sample}/outs/* {params.outputdir}/ "
        "&& tar --remove-files -czf {output.pstance} {params.countdir} "
DaliBAmor commented 1 year ago

thank you

benduc commented 1 year ago

Hi @nigiord , Thank you for sharing this! Could you please paste the complete rule? This would be very helpful!

nigiord commented 1 year ago

Hi @benduc , sorry for the late reply (parental leave). Here you go, hope this helps.

# This part is about preprocessing the sequencing data,
# from raw runs up to the gex and atac BAM files
def get_count_inputs(wildcards):
    # libraries files to get sample locations
    libraries = os.path.join(libdir, wildcards.sample + ".libraries.csv")
    # directories containing fastqs, in absolute path
    lib_df = pd.read_csv(libraries)
    # convert to maindir-related path
    # Remove duplicated entries and Data/ subdirectory
    fastqdirs = [
        os.path.relpath(d.removesuffix("/Data"), start=maindir)
        for d in lib_df.fastqs.unique()
    ]
    return {"libraries": libraries, "fastqdirs": fastqdirs}

rule count:
    """ Generate BAM and feature-barcode matrix from the libraries file provided.
    This rules takes the fastq directories as inputs.

    Whole directory is added as output to ensure all files are protected.
    This is just a sloppy way of not listing all cellranger-arc count outputs.
    This also ensures that, once write protection is removed, the whole directory 
    is wiped out before trying to re-run cellranger-arc count.

    {output.pstance} contains useful information, like the exact command as it was 
    runned on the cluster, including the path to the genome of reference that was used.

    """
    output:
        featurebarcode=expand(
            os.path.join(bamdir, "{{sample}}", "filtered_feature_bc_matrix", "{name}"),
            name=["barcodes.tsv.gz", "features.tsv.gz", "matrix.mtx.gz"],
        ),
        atacfrg=os.path.join(bamdir, "{sample}", "atac_fragments.tsv.gz"),
        atacbam=os.path.join(bamdir, "{sample}", "atac_possorted_bam.bam"),
        gexxbam=os.path.join(bamdir, "{sample}", "gex_possorted_bam.bam"),
        pstance=os.path.join(bamdir, "{sample}", "pipestance-count.tar.gz"),
        outpdir=protected(directory(os.path.join(bamdir, "{sample}"))),
    input:
        # Inputs: libraries, fastqdirs
        unpack(get_count_inputs)
    log:
        os.path.join(bamdir, "Logs", "{sample}.cellranger-arc-count.log")
    params:
        refgenomedir=refgenomedir,
        outputdir=os.path.join(bamdir, "{sample}"),
        countdir="Pipestance-count-{sample}"
    threads: 8
    resources:
        # observed max-vmem was 33 to 38 GB
        mem="50G",
        mem_gb=50,  # for cellranger-arc
    shell:
        # Snakemake takes care of the cluster submission.
        # As for demultiplexing, --id cannot be a path so the pipestance is 
        # created locally and has to be compressed and moved afterwards.
        "cellranger-arc count "
        "--id={params.countdir} "
        "--reference={params.refgenomedir} "
        "--libraries={input.libraries} "
        "--jobmode=local "
        "--localcores={threads} "
        "--localmem={resources.mem_gb} "
        "--disable-ui "  # Disable web interface
        "&> {log} "
        "&& mv {params.countdir}/outs/* {params.outputdir}/ "
        "&& tar --remove-files -czf {output.pstance} {params.countdir} "
makrez commented 11 months ago

Is there any update on this? It would be really handy to have an --out-dir variable.

DaliBAmor commented 10 months ago

problem solved thank you very much and sorry for the delay. your messages passed to SPAM section

Le jeu. 12 oct. 2023 à 11:00, makrez @.***> a écrit :

Is there any update on this? It would be really handy to have an --out-dir variable.

— Reply to this email directly, view it on GitHub https://github.com/10XGenomics/cellranger/issues/82#issuecomment-1759213194, or unsubscribe https://github.com/notifications/unsubscribe-auth/A6Q36SKK4VQ23Z7XA6NYZZLX66WRVANCNFSM4OUFGIUA . You are receiving this because you were mentioned.Message ID: @.***>

deto commented 10 months ago

This would be useful to have - ran into this issue myself writing a snakemake workflow today..

benduc commented 10 months ago

Hi all, If anyone is running into the same issue, I found a workaround using Snakemake's shadow rules:

import pandas as pd
import re

configfile: "config/config.yaml"

# Samples to process
samplesData = pd.read_csv(config["SAMPLES_CSV_PATH"])

SAMPLES = samplesData["sample"].tolist()
CELLRANGER_PATH = config["CELLRANGER_PATH"]
TRANSCRIPTOME_DIR = config["TRANSCRIPTOME_DIR"]

rule all:
    input: expand("output/{sample}/filtered_feature_bc_matrix.h5", sample=SAMPLES)

rule count:
    output:
        filtered_matrix = "output/{sample}/filtered_feature_bc_matrix.h5",
        raw_matrix = "output/{sample}/raw_feature_bc_matrix.h5",
        web_summary = "output/{sample}/web_summary.html",
        metrics_summary = "output/{sample}/metrics_summary.csv",
        bam = "output/{sample}/possorted_genome_bam.bam",
        bai = "output/{sample}/possorted_genome_bam.bam.bai",
    input:
        fastq_dir = "input/fastq/{sample}"
    shadow:
        "shallow"
    params:
        transcriptome = TRANSCRIPTOME_DIR,
        cellranger_path = CELLRANGER_PATH,
        chemistry = lambda wildcards: samplesData.loc[samplesData['sample'] == wildcards.sample, 'chemistry'].iloc[0],
    shell:
        """
        {params.cellranger_path} count \
        --id={wildcards.sample} \
        --transcriptome={params.transcriptome} \
        --fastqs="{input.fastq_dir}" \
        --sample={wildcards.sample} \
        --nosecondary \
        --chemistry={params.chemistry} \
        && \
        cp -p {wildcards.sample}/outs/filtered_feature_bc_matrix.h5 {output.filtered_matrix} \
        && \
        cp -p {wildcards.sample}/outs/raw_feature_bc_matrix.h5 {output.raw_matrix} \
        && \
        cp -p {wildcards.sample}/outs/web_summary.html {output.web_summary} \
        && \
        cp -p {wildcards.sample}/outs/metrics_summary.csv {output.metrics_summary} \
        && \
        cp -p {wildcards.sample}/outs/possorted_genome_bam.bam {output.bam} \
        && \
        cp -p {wildcards.sample}/outs/possorted_genome_bam.bam.bai {output.bai}
        """

This works with a csv file describing the samples in this kind of format: sample,dataset,chemistry sample_1,dataset_1,auto

Snakemake creates a temporary directory from which you can export the files you are interested in, and then it automatically removes this temporary directory.

Hope this helps!

evolvedmicrobe commented 10 months ago

Cell Ranger now comes with an --output-dir option to enable this, more details are available here

nigiord commented 6 months ago

Cell Ranger now comes with an --output-dir option to enable this, more details are available here

Hi @evolvedmicrobe, thank you very much! (for people wondering, the new link is here).

Is there any plan to also add this to cellranger-arc count? That would be really useful for people working on scMultiome.