Open loukesio opened 3 years ago
I often use bash for loops to rename the 'output.gd' files based on what I set the output fold as when I started the run. So for the breseq commands:
breseq -r Reference.gbk -o Output_folder/Sample1 fastq1.R1.fastq fastq1.R2.fastq
breseq -r Reference.gbk -o Output_folder/Sample2 fastq2.R1.fastq fastq2.R2.fastq
After the breseq runs are complete, i would execute the following command inside the Output_folder:
for sample in *; do cp $sample/output/output.gd $sample.gd; done
I actually tend to get more elaborate putting this as part of a small bash script i have named "genome_diff_collection.sh"
#! /bin/bash
for sample in *; do cp $sample/output/output.gd $sample.gd; cp $sample/output/evidence/annotated.gd $sample.annotated.gd; done
mkdir all_gds_annotated
mv *.annotated.gd all_gds_annotated
mkdir all_gds
mv *.gd all_gds
If you aren't using unique output directories at the breseq command line, I have some ideas of how you could use grep to capture the "#=READSEQ" lines and edit them in a text editor or excel to give you a list of copy commands that you would paste onto the prompt to rename everything for you.
Thank you Daniel for sharing your code. I think thats the way to go
To add a couple of other ways of accomplishing workflows that involve many samples to the thread...
1) Use gdtools
utility commands with GenomeDiff files describing the input/output
Start by creating GenomeDiff files with #=TITLE
, #=READSEQ
and #=REFSEQ
information (and optionally #=POPULATION
, #=TREATMENT
, and #=TIME
). For example,....
#=GENOME_DIFF 1.0
#=TITLE Ara-1_2000gen_1164B
#=REFSEQ https://raw.githubusercontent.com/barricklab/LTEE/7da91974eafac0c5a8f903ae57275795d4395af2/reference/REL606.gbk
#=READSEQ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR051/ERR051713/ERR051713.fastq.gz
#=TIME 2000
#=POPULATION Ara-1
#=TREATMENT LTEE
#=CLONE B
Then, name these like sample1.gd
, 'sample2.gd, etc. and put them in a folder named
01_Datain your directory you will use for running _breseq_. Plop all of the read and reference sequences into a folder called
02_Downloadsin the same directory. (If you use URLs in the location of these files as in the example, you can use
gdtools DOWNLOAD` to retrieve all of the files to that directory. But, it is also OK to copy them there yourself. As long as the filenames at the end of the paths match what is in the GD file at the end of the path, it will detect and us them fine.)
Now, you can use gdtools RUNFILE
to generate a file that has breseq
commands with the READ and REFERENCE files set correctly for each sample. Use the option --options <your_breseq_options>
to change the options that are added to the breseq commands. You can run this script, or farm it out to a cluster.
After the output is present for all samples, commands like this should now work for comparing everything, because each output.gd
file inherits the metadata (title) from the original GD file you made:
gdtools COMPARE -r 02_Downloads/<your_references> 03_Output/*/output/output.gd
With this method, you still have to create these initial GenomeDiff files that describe each sample. I think @danieldeatherage has a script for creating these from a folder of FASTQ and some metadata.
2) Use the utility script batch_run.pl
to automate commands being run on sets of directories
This is a script I created for the common situation of needing to do something once for each directory inside of the current directory or once for each file inside of the current directory. It accomplishes what @danieldeatherage's bash script does, but with some other functionality (by invoking a Perl script).
Download this and put it in your $PATH
https://github.com/barricklab/barricklab/blob/master/batch_run.pl
To get the full help, use:
batch_run.pl --man
I always recommend running things in test mode first to print what commands it will run if you are not in test mode first (so you don't do something crazy by mistake).
batch_run.pl -t ...
Now, if you are in a folder that contains many folders of breseq output for different samples named sample1
, sample2
, sample3
, etc., then you can use this command to copy all output.gd
files and rename them to sample1.gd
, sample2.gd
, sample3.gd
, etc.
mkdir gd
batch_run.pl "cp output/output.gd ../gd"
What's happening is the script enters each directory and runs this command, replacing #d
with the name of the directory. The ../gd
copies back up a level to the gd
directory we made outside of directory it is in.
You can use the -0
flag to operate on all files in the current directory (instead of on all directories). In this case you can use #e
in the command to substitute the name of the file with the extension removed.
Dear all,
Does anybody have an idea on how to rename hundreds output.gd files with the name of the fastq file they are coming from? So far I am doing it manually.
Thank you for your time