Open anyusernamenottaken opened 11 years ago
The template available at https://github.com/bg7/BG7/blob/master/executionsTemplate.xml include correct parameters for bacteria:
dif-span is a parameter related with the joining of different HSPs to construct a gene with coherent fragments that probably belongs to the same gene. (See CDS Definition at http://bg7.ohnosequences.com/docs/how-it-works/) The parameter indicates the maximun difference allowed to join fragments between the distance between two HSPs (in the reference protein) and the distance of the corresponding aligned fragments in the sequence under analysis.
<execution>
<class_full_name>com.era7.bioinfo.annotation.PredictGenes</class_full_name>
<arguments>
<argument>XX_proteins_tBLASTn.xml</argument>
<argument>XX_sequences_header_fixed.fna</argument>
<argument>XX_PredictedGenes.xml</argument>
<argument>400</argument>
<argument>false</argument>
<argument>30</argument>
</arguments>
</execution>
These are the start codons that we use for bacteria (when argument virus flag is false):
public static final String[] START_CODONS = {"ATG","CTG","GTG","TTG"};
We search for any of these codons upstream the most upstream BLAST HSP detected with the protein responsible for the prediction of that gene. In many cases the codon selected as start codon for the new predicted gene is the first codon of the first HSP (considered in the orientation of the protein).
If you lose some GTG starts it could be related with the proteins used as reference.
We have in mind some improvements for gene start and end prediction for the next version of BG7.
Hi team,
I've been trying to get BG7 working for a while (annotating a ~5MB Streptomyces genome), and now think I've found a couple of bugs and workarounds. I'm working with the latest version as of July/August 2013:
Finally, assuming my latest (long) runs end up working, I'm curious about improving the predict genes portion of the algorithm: Streptomycetes use GTG a lot as an alternative start codon; RAST catches this, but most of the genes output in the pared-down test runs of my genome with BG7 forced nearby ATG codons for the start of genes. Of course I don't know it's wrong, but given other published genomes, I'd suspect the RAST starts are closer to the truth. Is there any way to alter the BG7 code to bias gene calling towards a known organism-specific codon usage?
Thanks, Drew