aidenlab / juicer

A One-Click System for Analyzing Loop-Resolution Hi-C Experiments
http://aidenlab.org
MIT License
414 stars 181 forks source link

CPU/common/cleanup.sh can unexpectedly modify INPUT fastq files #239

Open rsharris opened 3 years ago

rsharris commented 3 years ago

Are you sure this is a bug? Yes, it is a bug.

Describe the bug CPU/common/cleanup.sh, lines 34-41 gzip the INPUT fastq files (unless they were already zipped).

This is extremely unusual and presumes that juicer is the only thing the user is going to do with those files.

To Reproduce Steps to reproduce the behavior:

  1. Set up a directory to run juicer.sh, including a fastq subdirectory
  2. Run juicer.sh
  3. After juicer.sh finishes, run cleanup.sh
  4. If your fastq files weren't already zipped, it's gonna zip them behind your back.
  5. If you didn't have write access, cleanup.sh will (presumably) fail, and won't complete the rest of the cleanup, leaving your aligned subdirectory in a half-cleaned state.

Expected behavior I don't expect tools to modify my input files.

Screenshots (no screeshot)

Desktop (please complete the following information): (not relevant)

Additional context (no other context)

sa501428 commented 3 years ago

Hey @rsharris, just to clarify, the concern is that the cleanup.sh script is compressing the fastq files? Are you requesting a flag for that script to not gzip the fastqs but do the rest of the cleanup? (maybe it's a script naming issue, but it's intentional that cleanup.sh compresses the input fastq files)

rsharris commented 3 years ago

I have two concerns. One is that the script modifies my input files, which is something I wouldn't have any reason to expect.

The other concern is that if my input files are write-protected, the script could fail, leaving my aligned/ subdirectory in a half-cleaned state from which (I presume) there's no automated recovery. By which I mean, of the seven files cleanup.sh intends to zip (other than the fastq files), six would be left unzipped.

Be aware that I'm new to juicer, and this is/was my first attempt at running it on a full genome. (I previously ran it on a small toy dataset). So I don't really know the specifics of how the contents of aligned directory is gonna be used downstream, and whether there are downstream tools that won't work if e.g. merged_nodups.txt isn't zipped.

To me the solution is to remove the fastq zipping lines from that script.

rsharris commented 3 years ago

To clarfy, by "if my input files are write-protected", I mean the following. cleanup.sh assumes it is (a) able to write a file into the fastq directory, and (b) it is able to delete files from that directory.

Im my case, the fastq 'directory' is a symlink to the real data directory, which is a directory where many users have read access to the dataset.