barricklab / breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence in short-read DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.
http://barricklab.org/breseq
GNU General Public License v2.0
137 stars 21 forks source link

Including Junctions and Missing Coverage information when building a phylogenetic tree #364

Open RayaRomm opened 5 months ago

RayaRomm commented 5 months ago

Dear Prof. Barrick,

I regularly use breseq, finding it to be very powerful and valuable tool for my research.

I recently used gdtools PHYLOGENY to get the phylogenetic tree for my clonal samples. I wanted to ask, when creating a phylogenetic tree with gdtools PHYLOGENY, is there an option to take into account the junctions evidence and the missing coverage evidence in addition to the predicted mutations?

Thank you very much in advance. Best, Raya

jeffreybarrick commented 5 months ago

No, it's not possible to use evidence for gdtools PHLYOGENY. It will only operate on mutations. To get nice trees, we will manually resolve MC and JC items into DEL, MOB, etc., run gdtools APPLY to generate the hypothesized evolved genome, and then run breseq with the original reads against the hypothesized genome in order to make sure we entered the right mutation. (It should come back "clean" with no mutations or junctions or missing coverage predicted.)

It is possible to classify JC items as being identical/equivalent between samples, so theoretically it would be possible to create an option to allow those to be included, though counting two junctions in place of one IS element insertion would not be great in terms of building a parsimony tree. For missing coverage, often the ends are a little different from sample to sample, so it would take some fuzzy matching to specify what is the same. These hacks might be useful for clustering samples, but I'd not recommend them for making final trees.

Some questions I would have for you that might help: are many of your JC are due to mobile element insertions and are you using you have a reference genome that allows breseq to fully predict those as MOBs? See Appendix "Reference Sequence Format" > Section "Adding IS Element Annotations" in the newest breseq v0.38.2 manual if you do not have these annotated in your reference sequence. (Our website is unfortunately down right now as far as viewing that or posting a direct link, but it's included in the release downloads.)

RayaRomm commented 5 months ago

Dear Prof. Barrick,

Thank you very much for your quick and detailed response.

Thank you for suggesting using gdtools APPLY. I wish to compare whole-genome sequencing of 20 clonal samples. Therefore, I wish to compare all samples to a specific genome rather than to update the genome with the mutations in a specific sample because I want these mutations to be apparent in the tree. In some of the samples several genes were deleted and they appear in breseq as missing coverage. I guess the best for me is to add them manually to the .gd file of the relevant sample and then run gdtools PHLYOGENY.

Does a region appear as missing coverage and not as a deletion because it is hard to determine its ends?

As far as I see, the JC that I got are not due to mobile element insertions. Perhaps they are related to the fact that I am using a genome that was de-novo assembled and there may be some uncertainties. I will try the "Adding IS Element Annotations" option to the breseq run.

Best, Raya

jeffreybarrick commented 5 months ago

Does a region appear as missing coverage and not as a deletion because it is hard to determine its ends?

Yes. breseq will only predict deletions for which it can determine to perfect nucleotide resolution the ends from split-read mapping.

As far as I see, the JC that I got are not due to mobile element insertions. Perhaps they are related to the fact that I am using a genome that was de-novo assembled and there may be some uncertainties.

There is an option to use a "contig" reference that will prevent predicting mutations near the end of each contig, which can include junctions between them. You could also try this. Instead of using -r in front of your reference, use -c. This assumes the copy numbers of all the contigs in the file are equal, so if you have plasmid contigs, split them out into separate files and pass them as a 2nd, 3rd, etc. -c or -r so their coverage is allowed to vary from that of the chromosome.

From the breseq command line help:

   -c,--contig-reference <arg>     File containing reference sequences in GenBank, GFF3, or FASTA format. The same coverage distribution will be fit to all of the
                                   reference sequences in this file simultaneously. This is appropriate when they are all contigs from a genome that should be present
                                   with the same copy number. Use of this option will improve performance when there are many contigs and especially when some are very
                                   short (≤1,000 bases).
RayaRomm commented 5 months ago

Thank you very much for your help. I was not familiar with the -c option. Raya