barricklab / breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence in short-read DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.
http://barricklab.org/breseq
GNU General Public License v2.0
149 stars 21 forks source link

Using -s option for reference file #301

Closed hm22rice closed 2 years ago

hm22rice commented 2 years ago

Hello, I am trying to understand the function of the -s option and if it will be applicable for me.

I am using a reference strain "A" whose complete genome is available on Genbank. I have modified strain "A" by chromosomally inserting a GFP gene in it. However, I do not know the insertion site of GFP.

If I want to use breseq to call the junction created by the GFP gene in my reference, would it be suitable to use the -s option? If so, would this usage be correct: breseq -r A.gbk -s GFP.fasta short_reads.fq.gz

Once I have identified the insertion site, I would like to create a modified reference sequence which includes the insertion. Would I be able to use gdtools to do that? Or is there another way to do so.

Thank you so much, Heer

danieldeatherage commented 2 years ago

Hi Heer,

You are correct using a junction only reference, and gdtools apply function can give you the reference file you want with slight additional work and details.

  1. Your breseq command is correct.

I expect you will find 2 unassigned new junctions towards the bottom of the mutations prediction page. 1 will correspond to the 5` end of the GFP insert, and the other to the 3` end of the insert. To get a reference file with the GFP sequence int the genome, you will need to create a new genome diff file with the following 2 lines: #=GENOME_DIFF<TAB>1.0 INS<TAB>.<TAB>.<TAB><Seq ID><TAB><5 prime location in genome where GFP junction was found><tab><GFP sequence to insert>

Notes on the above:

Next you will use gdtools apply as follows to generate your new reference: gdtools apply -r A.gbk -f gff3 -o A.updated.gff3 <new genome diff file you just created>

Strongly recommend you verify correct insertion by then rerunning breseq: breseq -r A.updated.gff3 short_reads.fq.gz

Assuming you do not see any new unassigned junctions (particularly corresponding to the genomic position you listed in your genome diff file) the reference file is accurate with resepect to where the GFP sequence integrated.

Additional steps you may be interested in:

  1. adding an annotation to the gff3 file listing the a portioon of the inserted GFP sequence as a CDS and any other associated annotations you wish to keep track of.
  2. A second gdtools apply to update the reference file you got from genbank to more accurately describe your specific strain as in my experience, there are always several SNPS at a minimum and more commonly multiple indels/deletions/amps/etc. To do this i would recommend using the genome diff file breseq generates (found in the output directory as output.gd)

Best of luck, Dan

hm22rice commented 2 years ago

Hi Dan, Thank you so much for the quick response. I will give this a try and let you know if I have any other questions. I appreciate your help.

Heer

jeffreybarrick commented 2 years ago

What @danieldeatherage said, and some other info that might be helpful to you or someone else who comes across this...

1/ There's also an INT mutation type that you can use for copying something over from one sequence to another with gdtools APPLY. It's more powerful than an INS for this case. It will let you reverse complement the copy if it is on the other strand and copy over the CDS/gene annotations if you used an annotated GFP sequence file that was in GenBank/GFF3 format.

The line would read something like:

INT<TAB>.<TAB>.<TAB><Seq ID><TAB><position where GFP was inserted><tab><number of bases replaced><tab>GFP:1-700

If the GFP sequence was 700 bases long, names GFP in GFP.fasta and in the forward direction. GFP:700-1 if it was in the reverse direction.

More info: https://barricklab.org/twiki/pub/Lab/ToolsBacterialGenomeResequencing/documentation/gd_format.html#int-integration-mutation

2/ If you inserted more than just GFP (some flanking sequences or it's in a transposon), be sure that you make the GFP sequence go all the way up to those ends that you know. breseq will have trouble finding junctions if your sequence and the genome sequence don't "meet" or at least come close (within something like <20 base pairs).

hm22rice commented 2 years ago

Thank you so much for the help. I was able to integrate the exogenous sequence into the genome of strain A.

Just a couple of things I did differently (putting this out here for anybody else who might want to do something similar):

  1. I used the INT mutation type in the new genome diff file. The line needed 7 tab separated columns instead of 6 so I modified it to look like this: INT..GFP:1-700

In my case, number of bases to replace in reference genome = 0

  1. After creating the new genome diff file, I ran gdtools using the following command line: gdtools APPLY -r A.gbk -r GFP.fasta -f gff3 -o A.updated.gff3 new_file.gd

I have two reference sequences since the new genome diff file only has the nucleotide positions of the integrated sequence but not the actual sequence.

Thank you, Heer

jeffreybarrick commented 2 years ago

Thanks for posting the correction! I edited the code above to add the missing size column that indicates how many bases from the original sequence at that position should be removed before inserting the integrated sequence.