barricklab / breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence in short-read DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.
http://barricklab.org/breseq
GNU General Public License v2.0
144 stars 21 forks source link

Detect antibiotic resistance marker in a WGS of a single colony #277

Closed loukesio closed 2 years ago

loukesio commented 3 years ago

Dear all,

I have constructed a bacterial colony with an antibiotic resistance marker which is 2000bp long. For each of my colonies I have done Sanger - Sequencing of the area to confirm its existence. Overall my sequencing went well. I have enough data and of good quality. I see this issue in multiple samples and numerous bacteria species. I do not see the insertion neither on the Unassigned missing coverage evidence nor on the Unassigned new junction evidence area.

When I do WGS and run breseq I can not detect the insertion of this area? Do you have any idea why? Am I doing something wrong with the parameters I set? Shall I change my reference and add the antibiotic resistance marker? Any help is appreciated.

this is my command btw

/data/biosoftware/breseq/breseq -j 8 -r .../REFSBW25.gbk   /.../test/12_4_c1_leaky_S1P_1.fastq /.../test/12_4_c1_leaky_S1P_2.fastq
jeffreybarrick commented 3 years ago

Yes, this is expected. breseq performs reference-based mapping and analysis. This means that it only maps reads to the reference sequences that you provide. If there is a "novel" insertion in a genome, such as your antibiotic resistance cassette, it doesn't know about that. The cassette's presence might reveal itself as some unassigned missing coverage at the insertion site (if some bases of the genome were deleted during the insertion). However, the reads that map there will be thrown into data/*unmatched.fastq files. This file could be used to assemble any novel sequences that are present, but that's not the best way to get the output that you want.

Solution

For genomes that include an insertion of a cassette of some sort, you should be able to determine where the cassette was inserted by providing the sequence of the cassette itself or a plasmid that contains the cassette as an additional reference argument. Then, breseq will output unassigned junctions showing you where they are connected (based on split-read mapping). It's best to provide the sequence of the cassette as another -r argument.

So, your command would become something like this

/data/biosoftware/breseq/breseq -j 8 -r .../REFSBW25.gbk  -r .../YOURCASSETTE.gbk /.../test/12_4_c1_leaky_S1P_1.fastq /.../test/12_4_c1_leaky_S1P_2.fastq

If you want to have breseq resolve these further and make your life easier, you can also add a repeat_region annotation that matches the range of bases in the cassette (or from the plasmid) that gets inserted by your modification procedure. Then, breseq would output a MOB type mutation with whatever the name of that cassette is. That is, it will resolve the two new unassigned junctions to one fully predicted mutation.

I've been thinking about making an example of this procedure. Maybe that would help explain things? We use it ourselves fairly often...

loukesio commented 3 years ago

Dear Jeff Barrick, Thank you for the prompt reply I highly appreciate. Its great and you saved the Friday. An example of this procedure indeed would be great

thank you for your time

loukesio commented 2 years ago

Dear all,

I have another question on the topic. In order to detect mutations, in a casstte, I have edited my genome in geneious, then save it and export is as a gb file. The annotations, the gff file, all looks good, but when I run Breseq I get this error

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> FATAL ERROR <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Error in reference file. Feature on sequence REFSBW25_barcoded has coordinates (6699684-6701516) that are both outside of the circular sequence bounds (1-6699662). FILE: reference_sequence.cpp LINE: 1764

Are you aware of how can someone edit the circular sequence bounds?

thank you for your time and I am looking forward to your reply

Any comment or direction are highly appreciated

jeffreybarrick commented 2 years ago

That's odd that Geneious apparently saves a GenBank file with a feature with coordinates fully outside of the sequence bounds. If you could share the file with me by email, I can take a look to see if this is a problem with the file or possibly with breseq misinterpreting it.

A Genbank file records the length of the sequence in several places. One the first LOCUS line, as the source feature, and (implicitly) as the length of the nucleotide sequence at the end of the file. breseq will usually give you a warning if these disagree and tell you which one it is using. I don't suppose this happened before the FATAL ERROR?

You can see these in the example GenBank record here.

loukesio commented 2 years ago

Dear Jeff,

Thank you so much for the reply, and for sending me the example. Indeed its really odd to me as well. I just you an email with my modified reference genome. We highly appreciate that you take the time

jeffreybarrick commented 2 years ago

OK. Two observations from the file you sent.

First, in breseq v0.36.1, this input runs file with a warning because we added some stricter checks and robust fallback procedures when different sequence lengths in the GenBank file don't agree. Yay!

Second, the five offending lines in your GenBank file are these:

     source          6697563..6699662
     source          <6697563..>6699662
     source          complement(<6697564..6698026)

and later in the file

     source          complement(6698027..6699293)

and later in the file

     source          complement(6699294..>6699662)

These are additional copies of what should be a special feature that occurs once per GenBank file. They've become duplicated and mangled sometime during editing in Geneious (probably from pasting together different sequences).

(Basically, there should only be one source feature at the top of the file, and its coordinates should match the length of the entire sequence, which is 1..6724639 in this case.)

If I delete these lines, it runs fine with earlier breseq version too.

loukesio commented 2 years ago

Thank you! It works perfectly!