bg7 / BG7

bacterial genome annotation system
bg7.ohnosequences.com
13 stars 7 forks source link

Set up Bio4J with BG7 AND working AMI AND BLAST #34

Open jeffr100 opened 11 years ago

jeffr100 commented 11 years ago

Hi,

I have three questions that I am posting together.

  1. Can you please give clear instructions about how I can build a Bio4J db for my reference sequences and incorporate them into my BG7 run?
  2. Can you please fix up the AMI so that it will work out of the box? It took a lot of tweaking and I still could not get it to run properly
  3. Can you please explain why you do a blast of the reference genes against the new genome when BLAST is optimized for the reverse of having a large database and smaller query sequences? You lose all of the benefits of BLAST indexing when the reference proteins are run as queries.

Thanks,

Jeff

eparejatobes commented 11 years ago

Hi Jeff,

I'll answer by quoting the number

  1. bg7 - bio4j integration. This is something yet unreleased, which we expect to make available in one or two months, so stay tuned!
  2. AMIs and all that. Sadly, We won't be supporting any kind of AMI for bg7. We need to update the docs on this. In the future, we plan to integrate bg7 into our ec2 deployment infrastructure, which we plan to release by the end of March. Anyway, I'd say that it's pretty simple to get this working: a jar + blast.
  3. why reference proteins as BLAST db. This something fairly important for our approach, and it looks like the docs are lacking in this respect. We'll add something about this ASAP

I'll close this once we get these docs improvements done

jeffr100 commented 11 years ago

Hi Eduardo,

Thank you for the answers.  I have tried running on EC2, but the program does not complete.  tblastn runs and then the log stops at:

running tblastn: proteins vs genome sequence Thu Jan 3 03:34:02 UTC 2013 and nothing else runs.

When I try starting over where it dies by running: java -d64 -Xmx20G -jar bg7.jar

I get the following errors:

Reading fna file... Done!! :) Calculating complementary inverted sequences.... Done! Parsing blastoutput XML file java.lang.ArrayIndexOutOfBoundsException at java.lang.String.getChars(String.java:862) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:408) at java.lang.StringBuilder.append(StringBuilder.java:136) at com.era7.bioinfo.annotation.PredictGenes.main(PredictGenes.java:189) at com.era7.bioinfo.annotation.PredictGenes.execute(PredictGenes.java:59) at com.era7.lib.bioinfo.bioinfoutil.ExecuteFromFile.main(ExecuteFromFile.java:66) at com.era7.bioinfo.annotation.BG7.main(BG7.java:32) java.io.FileNotFoundException: 21172_PredictedGenes.xml (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:137) at java.io.FileReader.(FileReader.java:72) at com.era7.bioinfo.annotation.RemoveDuplicatedGenes.main(RemoveDuplicatedGenes.java:88) at com.era7.bioinfo.annotation.RemoveDuplicatedGenes.execute(RemoveDuplicatedGenes.java:47) at com.era7.lib.bioinfo.bioinfoutil.ExecuteFromFile.main(ExecuteFromFile.java:66) at com.era7.bioinfo.annotation.BG7.main(BG7.java:32) java.io.FileNotFoundException: 21172_NoDuplicates.xml (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:137) at java.io.FileReader.(FileReader.java:72) at com.era7.bioinfo.annotation.SolveOverlappings.main(SolveOverlappings.java:90) at com.era7.bioinfo.annotation.SolveOverlappings.execute(SolveOverlappings.java:49) at com.era7.lib.bioinfo.bioinfoutil.ExecuteFromFile.main(ExecuteFromFile.java:66) at com.era7.bioinfo.annotation.BG7.main(BG7.java:32) Jan 3, 2013 3:10:49 PM com.era7.bioinfo.annotation.GenerateFastaFiles main SEVERE: null java.io.FileNotFoundException: 21172_SolvedOverlaps.xml (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:137) at java.io.FileReader.(FileReader.java:72) at com.era7.bioinfo.annotation.GenerateFastaFiles.main(GenerateFastaFiles.java:86) at com.era7.bioinfo.annotation.GenerateFastaFiles.execute(GenerateFastaFiles.java:58) at com.era7.lib.bioinfo.bioinfoutil.ExecuteFromFile.main(ExecuteFromFile.java:66) at com.era7.bioinfo.annotation.BG7.main(BG7.java:32) java.io.FileNotFoundException: 21172_SolvedOverlaps.xml (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:137) at java.io.FileReader.(FileReader.java:72) at com.era7.bioinfo.annotation.FillDataFromUniprot.main(FillDataFromUniprot.java:67) at com.era7.bioinfo.annotation.FillDataFromUniprot.execute(FillDataFromUniprot.java:48) at com.era7.lib.bioinfo.bioinfoutil.ExecuteFromFile.main(ExecuteFromFile.java:66) at com.era7.bioinfo.annotation.BG7.main(BG7.java:32) I would really like to get this working to annotate a bunch of staph genomes so any help you can give me would be great. I have spent a lot of time trying to get this to work, so if it makes things easier to diagnose, I can give you a login onto the ec2 machine to check it.

Thanks,

Jeff

On 01/03/2013 12:35 PM, Eduardo Pareja Tobes wrote:

Hi Jeff,

I'll answer by quoting the number

  1. bg7 - bio4j integration. This is something yet unreleased, which we expect to make available in one or two months, so stay tuned!
  2. AMIs and all that. Sadly, We won't be supporting any kind of AMI for bg7. We need to update the docs on this. In the future, we plan to integrate bg7 into our ec2 deployment infrastructure, which we plan to release by the end of March. Anyway, I'd say that it's pretty simple to get this working: a jar + blast.
  3. why reference proteins as BLAST db. This something fairly important for our approach, and it looks like the docs are lacking in this respect. We'll add something about this ASAP

I'll close this once we get these docs improvements done

— Reply to this email directly or view it on GitHubhttps://github.com/bg7/BG7/issues/34#issuecomment-11852632.

Jeffrey Rosenfeld, Ph. D Assistant Professor - New Jersey Medical School IST/High Performance and Research Computing University of Medicine and Dentistry of New Jersey (UMDNJ) 973-972-1004 (voice) 973-972-7412 (fax) MSB-C631 185 South Orange Avenue Newark, NJ 07101

pablopareja commented 11 years ago

Hi Jeff,

Could you confirm me from where did you get that bg7.jar file ? I'd like to know because it looks like it's not the latest version (taking into account the line you're getting the exception at).

Cheers,

Pablo

jeffr100 commented 11 years ago

Hi Pablo,

I re-ran it with the new version from github and got the same error. Can you please pack up a full working executable so I can try it on my EC2 instance?

Thanks,

Jeff

logging state of the annotation to /media/ephemeral0/staph/21172.out/bg7.log Reading fna file... Done!! :) Calculating complementary inverted sequences.... Done! Parsing blastoutput XML file java.lang.ArrayIndexOutOfBoundsException at java.lang.String.getChars(String.java:862) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:408) at java.lang.StringBuilder.append(StringBuilder.java:136) at com.era7.bioinfo.annotation.PredictGenes.main(PredictGenes.java:176) at com.era7.bioinfo.annotation.PredictGenes.execute(PredictGenes.java:46) at com.era7.lib.bioinfo.bioinfoutil.ExecuteFromFile.main(ExecuteFromFile.java:66) at com.era7.bioinfo.annotation.BG7.main(BG7.java:32) java.io.FileNotFoundException: 21172_PredictedGenes.xml (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:137) at java.io.FileReader.(FileReader.java:72) at com.era7.bioinfo.annotation.RemoveDuplicatedGenes.main(RemoveDuplicatedGenes.java:80) at com.era7.bioinfo.annotation.RemoveDuplicatedGenes.execute(RemoveDuplicatedGenes.java:39) at com.era7.lib.bioinfo.bioinfoutil.ExecuteFromFile.main(ExecuteFromFile.java:66) at com.era7.bioinfo.annotation.BG7.main(BG7.java:32) java.io.FileNotFoundException: 21172_NoDuplicates.xml (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:137) at java.io.FileReader.(FileReader.java:72) at com.era7.bioinfo.annotation.SolveOverlappings.main(SolveOverlappings.java:82) at com.era7.bioinfo.annotation.SolveOverlappings.execute(SolveOverlappings.java:41) at com.era7.lib.bioinfo.bioinfoutil.ExecuteFromFile.main(ExecuteFromFile.java:66) at com.era7.bioinfo.annotation.BG7.main(BG7.java:32) Jan 7, 2013 3:29:01 PM com.era7.bioinfo.annotation.GenerateFastaFiles main SEVERE: null java.io.FileNotFoundException: 21172_SolvedOverlaps.xml (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:137) at java.io.FileReader.(FileReader.java:72) at com.era7.bioinfo.annotation.GenerateFastaFiles.main(GenerateFastaFiles.java:77) at com.era7.bioinfo.annotation.GenerateFastaFiles.execute(GenerateFastaFiles.java:49) at com.era7.lib.bioinfo.bioinfoutil.ExecuteFromFile.main(ExecuteFromFile.java:66) at com.era7.bioinfo.annotation.BG7.main(BG7.java:32) java.io.FileNotFoundException: 21172_SolvedOverlaps.xml (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:137) at java.io.FileReader.(FileReader.java:72) at com.era7.bioinfo.annotation.FillDataFromUniprot.main(FillDataFromUniprot.java:63) at com.era7.bioinfo.annotation.FillDataFromUniprot.execute(FillDataFromUniprot.java:44) at com.era7.lib.bioinfo.bioinfoutil.ExecuteFromFile.main(ExecuteFromFile.java:66) at com.era7.bioinfo.annotation.BG7.main(BG7.java:32)


Jeffrey Rosenfeld, Ph. D IST/High Performance and Research Computing University of Medicine and Dentistry of New Jersey (UMDNJ) 973-972-1004 (voice) 973-972-7412 (fax) MSB-C631 185 South Orange Avenue Newark, NJ 07101

Sackler Institute for Comparative Genomics American Museum of Natural History

On Jan 3, 2013, at 1:39 PM, Pablo Pareja Tobes wrote:

Hi Jeff,

Could you confirm me from where did you get that bg7.jar file ? I'd like to know because it looks like it's not the latest version (taking into account the line you're getting the exception at).

Cheers,

Pablo

— Reply to this email directly or view it on GitHubhttps://github.com/bg7/BG7/issues/34#issuecomment-11855144.

pablopareja commented 11 years ago

Hi Jeffrey,

Could you tell me what kind of instance are you using? Also how much memory are providing to the java process by means of the option -Xmx? And finally, how much does your BLAST XML file weight?

Cheers,

Pablo

jeffr100 commented 11 years ago

I am using a m2.2xlarge instance -Xmx is set to 12G the XML file is 3GB in size.

Jeff


Jeffrey Rosenfeld, Ph. D IST/High Performance and Research Computing University of Medicine and Dentistry of New Jersey (UMDNJ) 973-972-1004 (voice) 973-972-7412 (fax) MSB-C631 185 South Orange Avenue Newark, NJ 07101

Sackler Institute for Comparative Genomics American Museum of Natural History

On Jan 8, 2013, at 3:34 AM, Pablo Pareja Tobes wrote:

Hi Jeffrey,

Could you tell me what kind of instance are you using? Also how much memory are providing to the java process by means of the option -Xmx? And finally, how much does your BLAST XML file weight?

Cheers,

Pablo

— Reply to this email directly or view it on GitHubhttps://github.com/bg7/BG7/issues/34#issuecomment-11988662.

pablopareja commented 11 years ago

Could you try doubling the amount of memory available for Java? (-Xmx24G)

jeffr100 commented 11 years ago

same problem with 32G of memory. Any more ideas? I am spending a lot on EC2 fees being your beta tester.

Jeff


Jeffrey Rosenfeld, Ph. D IST/High Performance and Research Computing University of Medicine and Dentistry of New Jersey (UMDNJ) 973-972-1004 (voice) 973-972-7412 (fax) MSB-C631 185 South Orange Avenue Newark, NJ 07101

Sackler Institute for Comparative Genomics American Museum of Natural History

On Jan 8, 2013, at 11:53 AM, Pablo Pareja Tobes wrote:

Could you try doubling the amount of memory available for Java? (-Xmx24G)

— Reply to this email directly or view it on GitHubhttps://github.com/bg7/BG7/issues/34#issuecomment-12005619.

eparejatobes commented 11 years ago

Hi again Jeff

We've ran bg7 on hundreds of genomes, and we've never seen this error. It's not that you're beta testing anything; now, if you want to get your annotations with bg7 as fast as possible, we can help in one of the following ways:

  1. If your data is open/publicly available, we're pretty happy to ran bg7 on it, and try to see hands-on what could be causing this issue
  2. If not, we at era7 bioinformatics offer a (pretty cheap) bacterial annotation service, and we could have this done lightning fast :) just drop us line if you're interested

Anyway, as I told you we've never seen this kind of error, and it doesn't look clear to me what could be causing it. The only two things that come to mind are

  1. In the past, we've seen some pretty strange error messages when running bg7 on top of some JVMs (old IcedTea/OpenJDK VMs), when using pretty big reference proteins sets. In the end, this was solved by increasing the number of open files system limit. You can do this by adding * hard nofile <number> to /etc/security/limits.conf (you'd possibly need to reboot)
  2. BLAST sometimes crashes in unexpected (and silent) ways, particularly so when generating XML output. The error could be related in some way with an empty/corrupted BLAST xml output. Are you sure that the xml file generates just fine?