barricklab / breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence in short-read DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.
http://barricklab.org/breseq
GNU General Public License v2.0
137 stars 21 forks source link

Making custom genebank reference files #187

Closed barrel0luck closed 5 years ago

barrel0luck commented 5 years ago

How does one go about making custom genebank files? For instance if I've made large rearrangements of genes or genomic regions - how can I make a custom genebank file for use with Breseq?

jeffreybarrick commented 5 years ago

For creating a custom reference genome for breseq that has gene annotations, we typically use a GFF3 file instead of a GenBank file. This format is much easier to edit by hand, and you can also use the gdtools APPLY command to program in mutations in a GD file (for example a list of mutations found across all of your samples that were clearly present in the ancestor) and generate the mutant genome. To create a starting GFF3 file from a GenBank file you can use breseq CONVERT-REFERENCE -f GFF3 input.gbk.

If you want to edit the GenBank file directly, you could use programs like Benchling or Geneious or (I think?) NCBI's Sequin tool.

barrel0luck commented 5 years ago

Thanks for the quick response. That's great if GFF3 can be used, I was not aware of that, I thought that only GenBank files could be used. I should be able to edit GFF3s easily - would it also need a fasta accompaniment then for the change in nucleotide sequence? How are base changes specified? The command you've given seems to need an input.gbk file (Sorry if this is confusing).

I've tried benchling before - but it's a bit slow to make large scale changes in a big genome file. Geneious doesn't work on linux (I think) and is not open source (I think). I thought Sequin was for making submissions to NCBI...

Can you point me to the relevant link in your manual regarding this, currently I'm looking at this one: http://barricklab.org/twiki/pub/Lab/ToolsBacterialGenomeResequencing/documentation/test_drive.html#reference-sequence And I dont seem to find it in the command line help output (It's possible that I'm just blind)...

jeffreybarrick commented 5 years ago

All of the breseq and gdtools commands will take FASTA, GFF3, or Genbank reference files as input. They will only output GFF3 or FASTA formatted files.

You can provide the DNA sequence of the reference as a FASTA section in the GFF3 file itself. http://gmod.org/wiki/GFF3

Or, you can provide two files using the-r option twice like -r reference_features.gff -r reference_sequence.fa to provide the features in one file and the sequence in another, as long as the seq_id's match up between the files.

barrel0luck commented 5 years ago

Got it, thanks!!!