HorvathLab / NGS

Next-Gen Sequencing tools from the Horvath Lab
https://horvathlab.github.io/NGS/
MIT License
39 stars 16 forks source link

Problem when .gz output file is present with the same name #12

Closed nigiord closed 1 year ago

nigiord commented 1 year ago

version: screadcounts-1.3.1

Hi and thank you for providing scReadCounts,

I integrated a scReadCounts step into a larger pipeline and I’m having an issue with the way scReadCounts choose how to save its outputs. Currently, it seems that for some reason, scReadCounts decides to save the output as a gzipped file if a file with the same name and the .gzextension already exists (see below at line 344 of the src/datasets.py file, unfortunately not present in this github repository).

image

Hence if this {outputname}.gz file already exists, the readCounts step overwrites it. The problem is that the next steps are looking for the filename without the .gz extension. An error is then thrown. This file exists in my pipeline because I have a step that gzip all my outputs at the end, so the issue actually occurs on subsequent runs of the pipeline (since the old outputs are still present). I also cannot directly ask scReadCounts to create .gz files since they are not defined as acceptable output extensions (even though a large part of the code seems dedicated to handling gzip files).

I’m not sure why line 344 exists and if it’s necessary. An easy fix would be to remove it. It would also be nice if gzipped files where an acceptable output type, since the objects defined in datasets.py seems to already support that.

Cheers, −Nils

edwardsnj commented 1 year ago

Hmmm. Yes, it probably shouldn’t do that. I’ll take a look today.

edwardsnj commented 1 year ago

Actually, the source dataset.py is in the common/src folder (many of the projects in this repository use source code from this folder), but that's neither here nor there.

I won't get into whether it makes sense to have a file you want to keep, with a .gz extension in the same folder as the intended output file without... shrug

Clearly correct behavior is not to write to a filename not supplied. There is utility to reading from the .gz file if the filename specified is not there and the .gz file is, so I'll leave this in for reading.

Not yet sure how I feel about directly writing to a .gz file. If set up it should be explicit, I guess, but these output files are not that big, not sure there is clear win here.

nigiord commented 1 year ago

Hi @edwardsnj, thank you for spending time on this.

To give a little more information: I use the bioconda screadcounts package version 1.3.1 and I have the following directory structure inside it:

screadcounts-1.3.1-0
├── bin
│   ├── readCounts
│   ├── readCountsMatrix
│   ├── scReadCounts
│   ├── scVarLoci
│   └── varLoci
├── build_env_setup.sh
├── conda_build.sh
├── docs
│   ├── Discovery.md
│   ├── Examples.md
│   ├── Filtering.md
│   ├── Grouping.md
│   ├── InputFiles.md
│   ├── Installation.md
│   ├── MultiInp.md
│   ├── OutputFiles.md
│   ├── SCReadCounts_GUI_re-run.png
│   ├── SCReadCounts_advanced.png
│   ├── SCReadCounts_basic.png
│   ├── SCReadCounts_filter.png
│   ├── SCReadCounts_output_options.png
│   └── Usage.md
├── metadata_conda_debug.yaml
├── scripts
│   ├── multi_inputs.sh
│   └── sc_multi_inputs.sh
└── src
    ├── __pycache__
    │   ├── chromreg.cpython-37.pyc
    │   ├── execute.cpython-37.pyc
    │   ├── fisher.cpython-37.pyc
    │   ├── pileups.cpython-37.pyc
    │   ├── pysamimport.cpython-37.pyc
    │   ├── release.cpython-37.pyc
    │   └── util.cpython-37.pyc
    ├── chromreg.py
    ├── dataset.py
    ├── execute.py
    ├── exonicFilter.py
    ├── filter.ini
    ├── fisher.py
    ├── group.ini
    ├── intervals.py
    ├── intervalscan.py
    ├── optional_requirements.txt
    ├── optparse_gui
    │   ├── __init__.py
    │   ├── __pycache__
    │   │   ├── __init__.cpython-37.pyc
    │   │   └── needswx.cpython-37.pyc
    │   └── needswx.py
    ├── phasedReadCounts.py
    ├── pileups.py
    ├── pysamimport.py
    ├── readCounts.py
    ├── readCountsMatrix.py
    ├── refnames.py
    ├── release.py
    ├── requirements.txt
    ├── scReadCounts.py
    ├── scVarLoci.py
    ├── split.py
    ├── util.py
    ├── varLoci.py
    └── variantloci.py

8 directories, 59 files

...hence why I was referencing to src/dataset.py

I won't get into whether it makes sense to have a file you want to keep, with a .gz extension in the same folder as the intended output file without... shrug

I find it convenient to just have a rule like this in my Snakemake pipeline to handle the gzipping:

rule gzip_tsv:
    output: "{filename}.tsv.gz"
    input: "{filename}.tsv"
    shell: "gzip {input}"

When the temporary tsv is not in the same directory, the rule becomes less straightforward. In any case as you mention I think the correct behavior to avoid corner cases is to only read/write to filenames supplied. Having files inadvertently gzipped can be frequent, especially on a shared filesystem.

Not yet sure how I feel about directly writing to a .gz file. If set up it should be explicit, I guess, but these output files are not that big, not sure there is clear win here.

I agree that allowing for .gz as outputs would be a nice touch but is not that important :) , since it’s rather easy to had a step that gzip the outputs.

Cheers, −Nils

edwardsnj commented 1 year ago

Hi @nigiord, I made a new release that fixes this behavior. It is currently just waiting for a review and merge in the bioconda repository, hopefully this will happen shortly.

nigiord commented 1 year ago

Thank you very much @edwardsnj !

edwardsnj commented 1 year ago

@nigiord latest bioconda version of screadcounts now matches the latest release SCReadCounts-1.3.2. Cheers!

AneliaHorvath commented 1 year ago

Thank you!

On Fri, Jan 27, 2023 at 2:02 PM Nathan Edwards @.***> wrote:

@nigiord https://github.com/nigiord latest bioconda version of screadcounts now matches the latest release SCReadCounts-1.3.2 https://github.com/HorvathLab/NGS/releases/tag/SCReadCounts-1.3.2. Cheers!

— Reply to this email directly, view it on GitHub https://github.com/HorvathLab/NGS/issues/12#issuecomment-1406952338, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADUOVDO6OY4YYXWMCD5AOTLWUQLUNANCNFSM6AAAAAAUAMJ5AI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Anelia Horvath, PhD

Director, McCormick Genomic and Proteomic CenterThe George Washington University2300 Eye Street NW, Suite 731Washington, DC 20037