bg7 / BG7

bacterial genome annotation system
bg7.ohnosequences.com
13 stars 7 forks source link

arguments #40

Open anyusernamenottaken opened 11 years ago

anyusernamenottaken commented 11 years ago

Hi team,

I've been trying to get BG7 working for a while (annotating a ~5MB Streptomyces genome), and now think I've found a couple of bugs and workarounds. I'm working with the latest version as of July/August 2013:

  1. When running the /bin/bg7 script, at line 396, when the program goes to make a copy of bg7.jar in the output directory, it can't locate the original. I noticed that the script is looking for "/jar/bg7.jar" and the directory structure for the file is "/jars/bg7.jar", or vice versa. But correcting this still didn't lead the script to find it, so I just wrote the entire path into this line.
  2. Once the program got as far as the PredictGenes script, it returned an error about expecting 6 inputs, but the output of the /bin/bg7 wrapper script only writes 5 arguments. I had to add a line to also write the Dif_span:30 argument so PredictGenes could proceed. Also: the default values for the last 3 arguments input to the PredictGenes.jar were 400, true, and 30 (I took the 30 value for dif_span from code buried within the default PredictGenes.jar file). Are these values appropriate for most microbes? Is there any documentation about what the arguments mean, in bioinformatic terms? It looked like the boolean value was describing whether the genome was viral, which seems like an odd default value to choose.
  3. I've run into other java errors about heap space and array index being out of bounds, but your earlier responses about these problems have helped me work around them, I think. I've gotten the test data you included with the code to work, as well as test runs of my genome when I pare down the reference protein or RNA set.

Finally, assuming my latest (long) runs end up working, I'm curious about improving the predict genes portion of the algorithm: Streptomycetes use GTG a lot as an alternative start codon; RAST catches this, but most of the genes output in the pared-down test runs of my genome with BG7 forced nearby ATG codons for the start of genes. Of course I don't know it's wrong, but given other published genomes, I'd suspect the RAST starts are closer to the truth. Is there any way to alter the BG7 code to bias gene calling towards a known organism-specific codon usage?

Thanks, Drew

rtobes commented 11 years ago

The template available at https://github.com/bg7/BG7/blob/master/executionsTemplate.xml include correct parameters for bacteria:

dif-span is a parameter related with the joining of different HSPs to construct a gene with coherent fragments that probably belongs to the same gene. (See CDS Definition at http://bg7.ohnosequences.com/docs/how-it-works/) The parameter indicates the maximun difference allowed to join fragments between the distance between two HSPs (in the reference protein) and the distance of the corresponding aligned fragments in the sequence under analysis.

<execution>
        <class_full_name>com.era7.bioinfo.annotation.PredictGenes</class_full_name>
        <arguments>
            <argument>XX_proteins_tBLASTn.xml</argument>
            <argument>XX_sequences_header_fixed.fna</argument>
            <argument>XX_PredictedGenes.xml</argument>
            <argument>400</argument>
            <argument>false</argument>
            <argument>30</argument>
        </arguments>
    </execution>
rtobes commented 11 years ago

These are the start codons that we use for bacteria (when argument virus flag is false):

public static final String[] START_CODONS = {"ATG","CTG","GTG","TTG"};

We search for any of these codons upstream the most upstream BLAST HSP detected with the protein responsible for the prediction of that gene. In many cases the codon selected as start codon for the new predicted gene is the first codon of the first HSP (considered in the orientation of the protein).

If you lose some GTG starts it could be related with the proteins used as reference.

We have in mind some improvements for gene start and end prediction for the next version of BG7.