barricklab / breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence in short-read DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.
http://barricklab.org/breseq
GNU General Public License v2.0
149 stars 21 forks source link

Rename output.gd file with the name of the fastq file or part of it #275

Open loukesio opened 3 years ago

loukesio commented 3 years ago

Dear all,

Does anybody have an idea on how to rename hundreds output.gd files with the name of the fastq file they are coming from? So far I am doing it manually.

Thank you for your time

danieldeatherage commented 3 years ago

I often use bash for loops to rename the 'output.gd' files based on what I set the output fold as when I started the run. So for the breseq commands: breseq -r Reference.gbk -o Output_folder/Sample1 fastq1.R1.fastq fastq1.R2.fastq breseq -r Reference.gbk -o Output_folder/Sample2 fastq2.R1.fastq fastq2.R2.fastq

After the breseq runs are complete, i would execute the following command inside the Output_folder: for sample in *; do cp $sample/output/output.gd $sample.gd; done

I actually tend to get more elaborate putting this as part of a small bash script i have named "genome_diff_collection.sh"

#! /bin/bash

for sample in *; do cp $sample/output/output.gd $sample.gd; cp $sample/output/evidence/annotated.gd $sample.annotated.gd; done mkdir all_gds_annotated mv *.annotated.gd all_gds_annotated mkdir all_gds mv *.gd all_gds

If you aren't using unique output directories at the breseq command line, I have some ideas of how you could use grep to capture the "#=READSEQ" lines and edit them in a text editor or excel to give you a list of copy commands that you would paste onto the prompt to rename everything for you.

loukesio commented 3 years ago

Thank you Daniel for sharing your code. I think thats the way to go

jeffreybarrick commented 3 years ago

To add a couple of other ways of accomplishing workflows that involve many samples to the thread...

1) Use gdtools utility commands with GenomeDiff files describing the input/output

Start by creating GenomeDiff files with #=TITLE, #=READSEQ and #=REFSEQ information (and optionally #=POPULATION, #=TREATMENT, and #=TIME). For example,....

#=GENOME_DIFF 1.0
#=TITLE Ara-1_2000gen_1164B
#=REFSEQ    https://raw.githubusercontent.com/barricklab/LTEE/7da91974eafac0c5a8f903ae57275795d4395af2/reference/REL606.gbk
#=READSEQ   ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR051/ERR051713/ERR051713.fastq.gz
#=TIME  2000
#=POPULATION    Ara-1
#=TREATMENT LTEE
#=CLONE B

Then, name these like sample1.gd, 'sample2.gd, etc. and put them in a folder named01_Datain your directory you will use for running _breseq_. Plop all of the read and reference sequences into a folder called02_Downloadsin the same directory. (If you use URLs in the location of these files as in the example, you can usegdtools DOWNLOAD` to retrieve all of the files to that directory. But, it is also OK to copy them there yourself. As long as the filenames at the end of the paths match what is in the GD file at the end of the path, it will detect and us them fine.)

Now, you can use gdtools RUNFILE to generate a file that has breseq commands with the READ and REFERENCE files set correctly for each sample. Use the option --options <your_breseq_options> to change the options that are added to the breseq commands. You can run this script, or farm it out to a cluster.

After the output is present for all samples, commands like this should now work for comparing everything, because each output.gd file inherits the metadata (title) from the original GD file you made:

gdtools COMPARE -r 02_Downloads/<your_references> 03_Output/*/output/output.gd

With this method, you still have to create these initial GenomeDiff files that describe each sample. I think @danieldeatherage has a script for creating these from a folder of FASTQ and some metadata.

2) Use the utility script batch_run.pl to automate commands being run on sets of directories

This is a script I created for the common situation of needing to do something once for each directory inside of the current directory or once for each file inside of the current directory. It accomplishes what @danieldeatherage's bash script does, but with some other functionality (by invoking a Perl script).

Download this and put it in your $PATH https://github.com/barricklab/barricklab/blob/master/batch_run.pl

To get the full help, use:

batch_run.pl --man

I always recommend running things in test mode first to print what commands it will run if you are not in test mode first (so you don't do something crazy by mistake).

batch_run.pl  -t  ...

Now, if you are in a folder that contains many folders of breseq output for different samples named sample1, sample2, sample3, etc., then you can use this command to copy all output.gd files and rename them to sample1.gd, sample2.gd, sample3.gd, etc.

mkdir gd
batch_run.pl "cp output/output.gd ../gd"

What's happening is the script enters each directory and runs this command, replacing #d with the name of the directory. The ../gd copies back up a level to the gd directory we made outside of directory it is in.

You can use the -0 flag to operate on all files in the current directory (instead of on all directories). In this case you can use #e in the command to substitute the name of the file with the extension removed.