bg7 / BG7

bacterial genome annotation system
bg7.ohnosequences.com
13 stars 7 forks source link

tBLASTn parameters and error with the PredictGenes program #35

Open ptreven opened 11 years ago

ptreven commented 11 years ago

Hi,

I must say you did great work with BG7. I am trying to use BG7 pipeline for annotation of our Lactobacillus gasseri strain (454 sequencing + Sanger gap closures – in large contigs). Firstly I have question about tBLASTn parameters. What are ussual parameters you use for tBLASTn comparisson (evalue, word size, penalties,…)?

Additionally what Extension_threshold and Overlapping_threshold do you recommend to start with, for this organism?

Secondly I encountered strange error with the PredictGenes program. Here is the printout of the last few rows:

hit = contig00004 length=117373 numreads=6125 Analyzing hsps hit, there are 1 Entering while... hsps.size()=1 Iteration Q047C0 has 0hits Iteration Q047E4 has 1hits hit = contig00009 length=78808 numreads=7139 Analyzing hsps hit, there are 1 Entering while... hsps.size()=1 Iterations finished !!! :) java.lang.NullPointerException at com.era7.bioinfo.annotation.PredictGenes.main(PredictGenes.java:446) C:\Users\PTreven\Downloads\BG7\BG7-master\jars>

Interestingly, this error occurred only when I useed XX_sequences_header_fixed.fna for an input. If I used the original XX_sequences.fna the porgram finished successfully. FixFastaHeaderQC.jar reported no problems.

By the way, I am using Windows 7 on Intel Core i7 3,4 GHz, 8 GB of RAM.

Thank you in advance for all the answers!

PTreven

pablopareja commented 11 years ago

Hi preteven,

Thanks for opening the issue. Could you confirm me where did you get the file BG7.jar from? I just checked the line where the exception is thrown and, at least in the last version, that line is actually commented out... :confused:

Cheers,

Pablo

ptreven commented 11 years ago

Hi Pablo,

I downloaded the BG7.jar from https://github.com/bg7/bg7 by clicking "Download this repository as a zip file" on 20.2.2013.

Thanx!

Primoz

pablopareja commented 11 years ago

OK, another question then, when you say you ran the program FixFastaHeaders, you did that before generating your BLAST XML files, right?

Perhaps this point is not clearly explained in the documentation but you are supposed to execute this program (when needed) as a preprocessing tool for input FASTA files at the very beginning of the process, that's to say, even before launching any BLAST ( @rtobes , @marina-manrique please correct me if I'm wrong with this)

rtobes commented 11 years ago

Yes, the program FixFastaHeaders have to be run before launching BLAST.

ptreven commented 11 years ago

OK, now it works, thanx! Since the launching the program was written first in the executionsTamplate.xml I assumed that this is the program to run before PredictGenes...sorry :)

Great, one mystery is solved, two more to go ;) Thanx again!

Primoz

pablopareja commented 11 years ago

No problem ;) Now that you say it, it's true that the documentation may be confusing in that sense. I will update it as soon as I have some time.

Regarding your other two questions:

_What are ussual parameters you use for tBLASTn comparisson (evalue, word size, penalties,…)? Additionally what Extension_threshold and Overlapping_threshold do you recommend to start with, for this organism?

@rtobes , @marina-manrique Could you chime in on this?

Cheers,

Pablo

rtobes commented 11 years ago

It depends on the goals of your annotation but you can try with an e-value of 10E-20 and default values for the rest of tBLASTn parameters. You can choose the Extension_threshold depending on the similarity that you expect with your reference proteins, the technology of sequencing used (different error rates), your preference of gene misprediction (larger than true genes or shorter). You must set Overlapping_threshold depending on the maximun expected size of overlapping fragments in genes in your bacteria and in some sense it also depends on the Extension_threshold. If you set Extension_threshold to a high value probably you need to be tolerant in overlapping and set Overlapping_threshold also high.

You can test the system with some fragment of an available well annotated genome and analyze the results with different values for Extension_threshold and Overlapping_threshold.

Raquel

ptreven commented 11 years ago

Thanx for all the information! Now I just have to do some annotation :)

Bye!

Primoz